Sorenson Video 1: Difference between revisions

From MultimediaWiki
Jump to navigation Jump to search
(Add to video codecs category.)
(Add technical description from multimedia.cx.)
Line 2: Line 2:
* Company: [[Sorenson]]
* Company: [[Sorenson]]
* Patents: US #5,844,612, "Motion vector quantizing selection system"
* Patents: US #5,844,612, "Motion vector quantizing selection system"
* Technical Description: [http://multimedia.cx/svq1-format.txt http://multimedia.cx/svq1-format.txt]


SVQ1 presumably stands for Sorenson Vector Quantizer #1 codec. It is also commonly known as Sorenson Video 1. The codec uses mean removal and hierarchical adaptive multistage vector quantization for intracoded compression and mean removal combined with motion compensation for interframe compression.
''This document is based on the document 'Description of the Sorenson Vector Quantizer #1 (SVQ1) Video Codec by Mike Melanson and Ewald Snel found at [http://multimedia.cx/svq1-format.txt http://multimedia.cx/svq1-format.txt].''
 
 
== Introduction ==
 
Sorenson Vector Quantizer #1 is identified by the four-character code
SVQ1. This FOURCC will be used throughout this document to identify the
codec.
 
SVQ1 is primarily used to encode video in Quicktime files. The format
was commonly seen from 1998-2002 before its successor, SVQ3 (which is
based on completely different coding concepts), was rolled out. The
algorithm uses hierarchical adaptive multistage vector quantization with
mean removal and motion compensation in order to achieve its
compression.
 
The SVQ1 patent listed in the references is a good source for
understanding both the encoding and decoding process. This document
focuses on principles necessary to implement a SVQ1 decoder.
 
This document assumes familiarity with certain general and
video-specific coding concepts such as Huffman (a.k.a. variable length)
coding, YUV colorspaces, inter/intraframe coding, and motion
compensation. A familiarity with vector quantization is useful, but this
document will explain what you need to know about the particular flavor
of VQ that SVQ1 uses.
 
 
== Algorithm Basics ==
 
SVQ1 operates on vectors (think blocks) of pixels. At the highest level,
the algorithm breaks an image into 16x16 blocks of pixels, or 256-pixel
vectors. For each vector, the encoder performs a motion search (if the
frame is not a keyframe), mean removal, and multi-stage codebook search
through a series of pre-defined codebooks. If the encoder finds a good
match for the vector after one or more of these operations, then it
outputs the proper bit coding for that vector. Otherwise, the 16x16
block is broken in half into 2 16x8 blocks (128-pixel vectors) and the
same operation is repeated (minus the motion search, which only
operates on the 16x16 vector level). SVQ1 operates at 6 block sizes:
 
  16x16
  16x8
  8x8
  8x4
  4x4
  4x2
 
If the encoder has to break a 4x4 block into 2 4x2 blocks, the block
cannot be subdivided further and must be encoded as is using mean
removal and a combination of vector codebooks.
 
SVQ1 uses YUV 4:1:0 planar data as its native colorspace. This is also
known as YUV9. Each 4x4 block of pixels is represented by one Y sample
per pixel, but only 1 U sample and 1 V sample for the entire block.
 
The SVQ1 coding method embodies 3 types of frames: Intraframes, or
I-frames, which are coded completely by themselves with no data
dependencies from other frames; predicted frames, or P-frames, which are
predicted from the previous I- or P-frame and thus have a data
dependency on the previous I- or P-frame; and droppable P-frames, or
B-frames, which are predicted from either the previous I- or P-frame and
thus have a data dependency. However, the latter frame type is similar
to MPEG B-frames in the sense that other frames cannot predict from
them, so it is safe to drop them when decoding is behind schedule. SVQ1
B-frames are dissimilar to MPEG B-frames in the sense that they are
unidirectional and not bidirectional.
 
 
== Stream Format And Header ==
 
A single SVQ1 chunk has the following high-level organization:
 
  header  |  Y plane data  |  U plane data  |  V plane data
 
Each SVQ1 chunk begins with a header which can be a variable number of
bits in length. The header bits have the following meaning:
 
  frame code          22 bits
  if (bits 6-5 of frame code are both 0 (i.e., frame code & 0x60 == 0))
    frame is invalid
 
  temporal reference  8 bits
  picture type        2 bits
 
  if (picture type is I-frame)
    if (frame code is 0x50 or 0x60)
      checksum        16 bits
    if (frame code ^ 0x10 > 0x50)
      embedded string  n bits
    unknown            2 bits
    unknown            2 bits
    unknown            1 bit
    frame size code    3 bits
    if (frame size code is 7)
      frame width    12 bits
      frame height    12 bits
 
  checksum present    1 bit
  if (checksum present)
    use packet chksum  1 bit
    component chksums  1 bit
      after image data
    unknown_field_1    2 bits
    if (unknown_field_1 is not 0)
      invalid frame header
 
  unknown_flag_1      1 bit
  if (unknown_flag_1)
    unknown            1 bit
    unknown            4 bits
    unknown            1 bit
    unknown            2 bits
    while (next bit is 1)
      read a byte from the stream
 
True, many of the header fields are unknown. But the stream can be decoded
just fine with the known header fields. The picture type has the following
4 possible values:
 
  0  I-frame
  1  P-frame
  2  B-frame
  3  invalid frame code, do not proceed with frame decode
 
If the frame size code is 7, 12 bits are stored in the header for each of
the width and the height. If the frame size code is any of the other 7
possible values, the frame size is one of these common frame dimension
pairs:
 
code  width  height
  0    160    120
  1    128      96
  2    176    144
  3    352    288
  4    704    576
  5    240    180
  6    320    240
 
A SVQ1 chunk can have a message string embedded inside the chunk header.
If (frame code ^ 0x10 > 0x50), follow this algorithm to decoder the
string:
 
  string[0] = next 8 bits
  xor_byte = string_xor_table[string[0]]
  foreach i from 1..string[0]-1
    next_byte = next 8 bits
    string[i] = (next_byte) ^ xor_byte
    xor_byte = string_xor_table[next_byte]
 
Note that since string[0] holds the string length and is a byte, the
maximum string length is 255 characters.
 
The specific details of the checksum coding are not all known.
 
 
== Decoding Intraframe Plane Data ==
 
After the frame header is decoded, the bitstream contains all of the
encoded data to render the Y plane, then the data for the U plane, then
the V plane.
 
For the purposes of this description, each block size is assigned a level:
 
  level 5  16x16
  level 4  16x8
  level 3  8x8
  level 2  8x4
  level 1  4x4
  level 0  4x2
 
For each of the 3 planes in an intraframe, iterate through each 16x16
block, which are each level 5 blocks, from left->right, top->bottom.
 
processing a level 5 block:
  get next bit in stream
  if (bit is 1)
    subdivide level 5 block into 2 level 4 blocks
    insert top level 4 block into processing queue
    insert bottom level 4 block into processing queue
  else
    process block, which entails:
    fetch VLC from bitstream using the appropriate intra vector table;
      this specifies the number of VQ stages
    if stages is -1, skip this block
    if (stages > 0) && (level >= 4)
      invalid vector, error out of decode since levels 4 and 5 blocks do
      not use multistage VQ
    fetch VLC from bitstream using the intra mean table; this specifies
      the mean removed from the vector during encoding
    if (stages == 0)
      set each sample in block to the mean value
    else
      fetch the next (stages * 4) bits from the bitstream; these specify
        the codebooks to use in reconstructing the image; the first 4 bits
        specify the first codebook to use, the next 4 bits specify the
        second codebook, all the way up to a sixth codebooks for 6
        possible stages
      add the mean to the each value in the block
      for each stage
        add the appropriate codebook to the block
 
If the level 5 block was subdivided into 2 level 4 halves, process each
half using the following algorithm:
 
processing a level 4 block:
  get next bit in stream
  if (bit is 1)
    subdivide level 4 block into 2 level 3 blocks
    insert left level 3 block into processing queue
    insert right level 3 block into processing queue
  else
    process block the same way that a full level 5 block is processed, but
      use the appropriate level 4 VLC tables and codebooks
 
If any level 3 blocks were inserted into the processing queue as a
result of level 4 block subdivisions, process each level 3 block in
order using this algorithm:
 
processing a level 3 block:
  ...
 
By now, you should be recognizing a pattern, e.g.:
 
process_block(level)
  if (next_bit == 1)
    subdivide block into 2 (level - 1) halves
    insert first half into processing queue
    insert second half into processing queue
  else
    ...process block...
 
Where does it end? The function requires a base case where the
processing can go no further. In this algorithm, blocks cannot be
subdivided beyond level 0 (4x2 samples):
 
process_block(level)
  if (level > 0) && (next_bit == 1)
    subdivide block into 2 (level - 1) halves
    insert first half into processing queue
    insert second half into processing queue
  else
    ...process block...
 
Thus, for each block, the entire block is processed as a level 5 block
and possibly subdivided into top and bottom halves for processing after
the level 5 block is processed. Suppose a level 5 block is subdivided
and that the resulting top level 4 block is further subdivided while the
bottom level 4 block is processed by itself. The processing can be
represented as a tree:
 
      --- 5 ---
      |      |
    -- 4 --    4
    |    |
    3    3
 
The SVQ1 algorithm processes each level of this tree in order from 5
down to 0. In the above example, the decoder would unpack the
information for the level 5 block, then each level 4 block, then each
level 3 block, before moving onto the next level 5 block. This is known
as a breadth-first tree traversal, as opposed to a depth-first traversal
which would process each of the 2 halves of a block subdivision before
moving onto the next block in the same level. The following abstract
information would be used to represent the level 5 block in the above
example:
 
  1  (subdivide level 5 block)
  1  (subdivide top level 4 block)
  0x (0 = no subdivision, x = removed mean and multistages for bottom
      level 4 block)
  0x (0 = no subdivision, x = removed mean and multistages for left
      level 3 block)
  0x (0 = no subdivision, x = removed mean and multistages for right
      level 3 block)
 
Note that if the processing tree subdivides to level 0, there will be no
0 in front of the mean and multistage information for the block since
the block can not be subdivided further.
 
As an example of the processing steps needed to reconstruct an
intraframe block, consider an example where the decoder has recursed
down to the base case of level 0. Fetch a VLC from the bitstream using
the intra multistage VQ table. For example, the retrieved VLC is 2 (from
a range of -1..6). This indicates mean removal plus 2 stages of VQ.
Extract a VLC from the bitstream using the intra mean removal VLC table.
For this example, the removed mean is 61. Extract 8 more bits from the
bitstream, 4 for each of the VQ stages. The first 4 bits indicate the
stage 1 codebook and the second 4 bits indicate the stage 2 codebook.
For this example, stage 1 = 4 and stage 2 = 14.
 
  base intra vector =
    0    0    0    0
    0    0    0    0
 
  removed mean = 61
 
  stage 1, codebook #4 vector for level 0 vectors:
    7  -16  -10  20
    7  -17  -10  20
 
  stage 2, codebook #14 vector for level 0 vectors:
  -13  -6  -1  -4
  25  37  -2  -35
 
First, add the mean to each element of the base vector:
 
  61  61  61  61
  61  61  61  61
 
Then, apply the stage 1 codebook:
 
  68  45  51  81
  68  44  51  81
 
Then, apply the stage 2 codebook:
 
  55  39  50  77
  93  81  49  46
 
This is the final 4x2 block of samples that will go into the output.
 
A note about wraparound/clipping/clamping/saturation, etc.: At each
stage of the addition or subtraction, the result of saturated to an
unsigned byte range, 0..255.
 
 
== Decoding Interframe Plane Data ==
 
Interframe decoding allows the SVQ1 encoder to use existing blocks of data
from the previous frame as references to encode current blocks (motion
compensation), encode blocks in the same manner as in an intraframe, or
just skip blocks altogether if they remain unchanged from the previous
frame. Interframes are either P- or B- frames and use data from the
previous I- or P-frame, but never the previous B-frame.
 
For each of the 3 planes in an interframe, iterate through each level 5
block, from left->right, top->bottom. Each level 5 block has one of the
following coding modes:
 
  0  SKIP: the block remains unchanged from the previous I- or P- frame
  1  INTER: perform the INTRA process using a motion block from the
    previous frame
  2  INTER_4MV: same as INTER but use 4 motion vectors instead of 1
  3  INTRA: the block is decoded using the same process described in the
    previous section
 
SVQ1 motion compensation operates in two modes. One mode operates on an
entire 16x16 block (INTER mode). The second mode splits a block into
four 8x8 subblocks and each subblock has a motion vector (INTER_4MV
mode). The motion vectors always operate on a half-pixel grid, similar
to MPEG-2.
 
A motion vector is predicted from the previous left, top, and top-right
motion vectors, if the vectors are available and coded within the
current frame. A vector of (0,0) is used in place of any unavailable
vectors. The predicting x motion component is the median value of the x
components of the 3 candidate vectors. The predicting y motion component
is computed the same way using the 3 candidate y components. In the case
that only one vector is available, that vector becomes the predicting
vector. Of course, if no vectors are available, then the predicting
vector is (0,0) which will be the same as (x,y) once the motion
components are decoded.
 
For example, if this is the first block of a slice which is not 0, then
the previous left vector, pl(x,y), is (0,0). Assume that the previous
top (pt) and top-right (ptr) vectors are both coded within the current
frame. pt(x,y) = (5,17) and ptr(x,y) = (-9,12):
 
  for x, MEDIAN(0,5,-9) = 0
  for y, MEDIAN(0,17,12) = 12
 
Thus, the predicting motion vector is (0,12). From here, the x motion
component for this block is decoded from the stream and added to 0,
followed by the y component which is added to 12. (0+x,12+y), clipped to
a range of -32..31, is the final motion vector.
 
An individual motion vector component is coded as a VLC magnitude
followed by a sign bit. If the next bit of the bitstream is 1 then the
motion component is 0. Overall, the algorithm for motion vector
component extraction resembles the following:
 
  if (peek_bit(bitstream) == 1)
    motion_component = 0
    skip_bit(bitstream)
  else
    motion_component = get_vlc(bitstream, motion_component_vlc_table)
    if (get_bit(bitstream) == 1)
      motion_component = -motion_component
 
Each 8x8 block in a frame corresponds to a motion vector. When a motion
vector is decoded for an INTER mode block, the motion vectors for the 4
constituent 8x8 blocks in the 16x16 block are all set to that same
motion vector. If a 16x16 block is decoded with either the INTRA or SKIP
mode, the motion vectors for its 4 8x8 blocks are reset to 0, which is
equivalent to indicating that the blocks were not coded in the current
frame. In the case that a 16x16 block is decoded using the INTER_4MV
mode, each of the 8x8 blocks ends up with a different motion vector.
 
If the current block is coded as INTER and only uses one motion vector
for the entire 16x16 block, the motion vector is predicted from the
previous left 8x8 block, and the previous top and top-right motion
vectors for 16x16 blocks. To clarify, consider the following example in
which each character in the grid represents a 8x8 block:
 
  A B C D E
  G H I J K
  M N 1 2 Q
  S T 3 4 W
 
    [1 2]
The [3 4] 16x16 block will use block N as previous left, block C as
previous top, and block E as previous top-right.
 
In the case of INTER_4MV mode, each of the 4 8x8 blocks gets a different
motion vector. Follow this table as an example:
 
    Predicted Motion Vector
          L    T  T-R
        +--------------
      1|  N    I    K
Block 2|  1    J    K
      3|  T    1    2
      4|  3    2    1
 
For block 3, the top-left motion vector is used rather than the
top-right motion vector.
 
Note that the motion vector for block 1 should be predicted and its
differential motion components decoded from the stream, applied, and the
resulting motion vector saved, all before decoding the motion vector for
block 2. The 4 motion vectors need to be fully decoded in series since
they depend on each other.
 
Once the motion vector is fully decoded and the reference 16x16 block is
copied from the appropriate block from the previous frame to the current
block in the current frame, repeat the same familiar intraframe decoding
process. The only difference is that you will add the codebook vectors
and removed mean to the interframe block rather than to a block of zero
samples.
 
 
== Appendix A: SVQ1 Data Tables ==
 
This is a listing of all of the data tables necessary to encode/decode
SVQ1 data.
 
VLC tables:
  interframe block coding types: This table is used for decoding the
    coding mode for each level 5 block in an interframe.
 
  motion vector component table: This table is used for extracting
    differential motion component magnitudes from the bitstream. The
    values from this table range from 1..32 (0 is a special case
    separate from the table).
 
  intraframe mean removal tables:
  interframe mean removal tables: These tables contain removed mean
    values. The intraframe mean removal table ranges from 0..255 while
    the interframe table ranges from -256..255.
 
  intraframe multistage VQ tables, one for each block level:
  interframe multistage VQ tables, one for each block level: These
    tables describe how a block is coded. The values in these tables
    range from -1..6:
    -1: skip block
    0: block is coded as mean-only (no multistage VQ)
    1..6: block is coded as mean plus 1..6 stages of VQ
 
Codebooks:
  intraframe vector codebooks, one for each level 0..3, 16 for each of 6
    stages:
  interframe vector codebooks, one for each level 0..3, 16 for each of 6
    stages:
  These tables contain a large number of signed bytes that are applied
    to vector levels to reconstruct vectors for a final output frame.
 
Other:
  checksum table: This table contains values for calculating checksums
    over the encoded and decoded SVQ1 data.
 
  string XOR table: This table contains logical XOR values used to
    obfuscate strings embedded within SVQ1 frame headers.
 
All of these data tables can be found in the CVS source repository for
the ffmpeg project:
http://mplayerhq.hu/cgi-bin/cvsweb.cgi/~checkout~/ffmpeg/libavcodec/svq1_cb.h?content-type=text/x-cvsweb-markup&cvsroot=FFMpeg
http://mplayerhq.hu/cgi-bin/cvsweb.cgi/~checkout~/ffmpeg/libavcodec/svq1_vlc.h?content-type=text/x-cvsweb-markup&cvsroot=FFMpeg
http://mplayerhq.hu/cgi-bin/cvsweb.cgi/~checkout~/ffmpeg/libavcodec/svq1.c?content-type=text/x-cvsweb-markup&cvsroot=FFMpeg
 
 
== References ==
 
Motion vector quantizing selection system
U.S. Patent #5,844,612
Inventors: Israelsen; Paul D. (North Logan, UT) 
Assignee: Utah State University Foundation (North Logan, UT) 
 
Quicktime.qts mega-DLL:
http://www.apple.com/quicktime/


SVQ1 data is transported within [[Apple QuickTime]] files. Apparently, the codec made its first appearance in Apple's QuickTime version 3. However, the codec quickly gained fame in early 1999 when Apple released version 4 of their QuickTime player and made movie trailers for [http://www.imdb.com/title/tt0120915/ Star Wars Episode I: The Phantom Menace] available for download. The quality of the codec greatly surpassed [[Cinepak]] which had been widely used in the downloadable multimedia domain for a long time.


[[Category:Video Codecs]]
[[Category:Video Codecs]]

Revision as of 08:22, 18 January 2006

  • FOURCCs: svq1, SVQ1, svqi
  • Company: Sorenson
  • Patents: US #5,844,612, "Motion vector quantizing selection system"

This document is based on the document 'Description of the Sorenson Vector Quantizer #1 (SVQ1) Video Codec by Mike Melanson and Ewald Snel found at http://multimedia.cx/svq1-format.txt.


Introduction

Sorenson Vector Quantizer #1 is identified by the four-character code SVQ1. This FOURCC will be used throughout this document to identify the codec.

SVQ1 is primarily used to encode video in Quicktime files. The format was commonly seen from 1998-2002 before its successor, SVQ3 (which is based on completely different coding concepts), was rolled out. The algorithm uses hierarchical adaptive multistage vector quantization with mean removal and motion compensation in order to achieve its compression.

The SVQ1 patent listed in the references is a good source for understanding both the encoding and decoding process. This document focuses on principles necessary to implement a SVQ1 decoder.

This document assumes familiarity with certain general and video-specific coding concepts such as Huffman (a.k.a. variable length) coding, YUV colorspaces, inter/intraframe coding, and motion compensation. A familiarity with vector quantization is useful, but this document will explain what you need to know about the particular flavor of VQ that SVQ1 uses.


Algorithm Basics

SVQ1 operates on vectors (think blocks) of pixels. At the highest level, the algorithm breaks an image into 16x16 blocks of pixels, or 256-pixel vectors. For each vector, the encoder performs a motion search (if the frame is not a keyframe), mean removal, and multi-stage codebook search through a series of pre-defined codebooks. If the encoder finds a good match for the vector after one or more of these operations, then it outputs the proper bit coding for that vector. Otherwise, the 16x16 block is broken in half into 2 16x8 blocks (128-pixel vectors) and the same operation is repeated (minus the motion search, which only operates on the 16x16 vector level). SVQ1 operates at 6 block sizes:

 16x16
 16x8
 8x8
 8x4
 4x4
 4x2

If the encoder has to break a 4x4 block into 2 4x2 blocks, the block cannot be subdivided further and must be encoded as is using mean removal and a combination of vector codebooks.

SVQ1 uses YUV 4:1:0 planar data as its native colorspace. This is also known as YUV9. Each 4x4 block of pixels is represented by one Y sample per pixel, but only 1 U sample and 1 V sample for the entire block.

The SVQ1 coding method embodies 3 types of frames: Intraframes, or I-frames, which are coded completely by themselves with no data dependencies from other frames; predicted frames, or P-frames, which are predicted from the previous I- or P-frame and thus have a data dependency on the previous I- or P-frame; and droppable P-frames, or B-frames, which are predicted from either the previous I- or P-frame and thus have a data dependency. However, the latter frame type is similar to MPEG B-frames in the sense that other frames cannot predict from them, so it is safe to drop them when decoding is behind schedule. SVQ1 B-frames are dissimilar to MPEG B-frames in the sense that they are unidirectional and not bidirectional.


Stream Format And Header

A single SVQ1 chunk has the following high-level organization:

 header  |  Y plane data  |  U plane data  |  V plane data

Each SVQ1 chunk begins with a header which can be a variable number of bits in length. The header bits have the following meaning:

 frame code          22 bits
 if (bits 6-5 of frame code are both 0 (i.e., frame code & 0x60 == 0))
   frame is invalid
 temporal reference   8 bits
 picture type         2 bits
 if (picture type is I-frame)
   if (frame code is 0x50 or 0x60)
     checksum        16 bits
   if (frame code ^ 0x10 > 0x50)
     embedded string  n bits
   unknown            2 bits
   unknown            2 bits
   unknown            1 bit
   frame size code    3 bits
   if (frame size code is 7)
     frame width     12 bits
     frame height    12 bits
 checksum present     1 bit
 if (checksum present)
   use packet chksum  1 bit
   component chksums  1 bit
     after image data
   unknown_field_1    2 bits
   if (unknown_field_1 is not 0)
     invalid frame header
 unknown_flag_1       1 bit
 if (unknown_flag_1)
   unknown            1 bit
   unknown            4 bits
   unknown            1 bit
   unknown            2 bits
   while (next bit is 1)
     read a byte from the stream

True, many of the header fields are unknown. But the stream can be decoded just fine with the known header fields. The picture type has the following 4 possible values:

 0  I-frame
 1  P-frame
 2  B-frame
 3  invalid frame code, do not proceed with frame decode

If the frame size code is 7, 12 bits are stored in the header for each of the width and the height. If the frame size code is any of the other 7 possible values, the frame size is one of these common frame dimension pairs:

code  width  height
  0    160     120
  1    128      96
  2    176     144
  3    352     288
  4    704     576
  5    240     180
  6    320     240

A SVQ1 chunk can have a message string embedded inside the chunk header. If (frame code ^ 0x10 > 0x50), follow this algorithm to decoder the string:

 string[0] = next 8 bits
 xor_byte = string_xor_table[string[0]]
 foreach i from 1..string[0]-1
   next_byte = next 8 bits
   string[i] = (next_byte) ^ xor_byte
   xor_byte = string_xor_table[next_byte]

Note that since string[0] holds the string length and is a byte, the maximum string length is 255 characters.

The specific details of the checksum coding are not all known.


Decoding Intraframe Plane Data

After the frame header is decoded, the bitstream contains all of the encoded data to render the Y plane, then the data for the U plane, then the V plane.

For the purposes of this description, each block size is assigned a level:

 level 5  16x16
 level 4  16x8
 level 3  8x8
 level 2  8x4
 level 1  4x4
 level 0  4x2

For each of the 3 planes in an intraframe, iterate through each 16x16 block, which are each level 5 blocks, from left->right, top->bottom.

processing a level 5 block:

 get next bit in stream
 if (bit is 1)
   subdivide level 5 block into 2 level 4 blocks
   insert top level 4 block into processing queue
   insert bottom level 4 block into processing queue
 else
   process block, which entails:
   fetch VLC from bitstream using the appropriate intra vector table;
     this specifies the number of VQ stages
   if stages is -1, skip this block
   if (stages > 0) && (level >= 4)
     invalid vector, error out of decode since levels 4 and 5 blocks do
     not use multistage VQ
   fetch VLC from bitstream using the intra mean table; this specifies 
     the mean removed from the vector during encoding
   if (stages == 0)
     set each sample in block to the mean value
   else
     fetch the next (stages * 4) bits from the bitstream; these specify 
       the codebooks to use in reconstructing the image; the first 4 bits 
       specify the first codebook to use, the next 4 bits specify the 
       second codebook, all the way up to a sixth codebooks for 6 
       possible stages
     add the mean to the each value in the block
     for each stage
       add the appropriate codebook to the block

If the level 5 block was subdivided into 2 level 4 halves, process each half using the following algorithm:

processing a level 4 block:

 get next bit in stream
 if (bit is 1)
   subdivide level 4 block into 2 level 3 blocks
   insert left level 3 block into processing queue
   insert right level 3 block into processing queue
 else
   process block the same way that a full level 5 block is processed, but 
     use the appropriate level 4 VLC tables and codebooks

If any level 3 blocks were inserted into the processing queue as a result of level 4 block subdivisions, process each level 3 block in order using this algorithm:

processing a level 3 block:

 ...

By now, you should be recognizing a pattern, e.g.:

process_block(level)

 if (next_bit == 1)
   subdivide block into 2 (level - 1) halves
   insert first half into processing queue
   insert second half into processing queue
 else
   ...process block...

Where does it end? The function requires a base case where the processing can go no further. In this algorithm, blocks cannot be subdivided beyond level 0 (4x2 samples):

process_block(level)

 if (level > 0) && (next_bit == 1)
   subdivide block into 2 (level - 1) halves
   insert first half into processing queue
   insert second half into processing queue
 else
   ...process block...

Thus, for each block, the entire block is processed as a level 5 block and possibly subdivided into top and bottom halves for processing after the level 5 block is processed. Suppose a level 5 block is subdivided and that the resulting top level 4 block is further subdivided while the bottom level 4 block is processed by itself. The processing can be represented as a tree:

      --- 5 ---
      |       |
   -- 4 --    4
   |     |
   3     3

The SVQ1 algorithm processes each level of this tree in order from 5 down to 0. In the above example, the decoder would unpack the information for the level 5 block, then each level 4 block, then each level 3 block, before moving onto the next level 5 block. This is known as a breadth-first tree traversal, as opposed to a depth-first traversal which would process each of the 2 halves of a block subdivision before moving onto the next block in the same level. The following abstract information would be used to represent the level 5 block in the above example:

 1  (subdivide level 5 block)
 1  (subdivide top level 4 block)
 0x (0 = no subdivision, x = removed mean and multistages for bottom
     level 4 block)
 0x (0 = no subdivision, x = removed mean and multistages for left
     level 3 block)
 0x (0 = no subdivision, x = removed mean and multistages for right
     level 3 block)

Note that if the processing tree subdivides to level 0, there will be no 0 in front of the mean and multistage information for the block since the block can not be subdivided further.

As an example of the processing steps needed to reconstruct an intraframe block, consider an example where the decoder has recursed down to the base case of level 0. Fetch a VLC from the bitstream using the intra multistage VQ table. For example, the retrieved VLC is 2 (from a range of -1..6). This indicates mean removal plus 2 stages of VQ. Extract a VLC from the bitstream using the intra mean removal VLC table. For this example, the removed mean is 61. Extract 8 more bits from the bitstream, 4 for each of the VQ stages. The first 4 bits indicate the stage 1 codebook and the second 4 bits indicate the stage 2 codebook. For this example, stage 1 = 4 and stage 2 = 14.

 base intra vector = 
   0    0    0    0
   0    0    0    0
 removed mean = 61
 stage 1, codebook #4 vector for level 0 vectors:
   7  -16  -10   20
   7  -17  -10   20
 stage 2, codebook #14 vector for level 0 vectors:
 -13   -6   -1   -4
  25   37   -2  -35

First, add the mean to each element of the base vector:

  61   61   61   61
  61   61   61   61

Then, apply the stage 1 codebook:

  68   45   51   81
  68   44   51   81

Then, apply the stage 2 codebook:

  55   39   50   77
  93   81   49   46

This is the final 4x2 block of samples that will go into the output.

A note about wraparound/clipping/clamping/saturation, etc.: At each stage of the addition or subtraction, the result of saturated to an unsigned byte range, 0..255.


Decoding Interframe Plane Data

Interframe decoding allows the SVQ1 encoder to use existing blocks of data from the previous frame as references to encode current blocks (motion compensation), encode blocks in the same manner as in an intraframe, or just skip blocks altogether if they remain unchanged from the previous frame. Interframes are either P- or B- frames and use data from the previous I- or P-frame, but never the previous B-frame.

For each of the 3 planes in an interframe, iterate through each level 5 block, from left->right, top->bottom. Each level 5 block has one of the following coding modes:

 0  SKIP: the block remains unchanged from the previous I- or P- frame
 1  INTER: perform the INTRA process using a motion block from the 
    previous frame
 2  INTER_4MV: same as INTER but use 4 motion vectors instead of 1
 3  INTRA: the block is decoded using the same process described in the 
    previous section

SVQ1 motion compensation operates in two modes. One mode operates on an entire 16x16 block (INTER mode). The second mode splits a block into four 8x8 subblocks and each subblock has a motion vector (INTER_4MV mode). The motion vectors always operate on a half-pixel grid, similar to MPEG-2.

A motion vector is predicted from the previous left, top, and top-right motion vectors, if the vectors are available and coded within the current frame. A vector of (0,0) is used in place of any unavailable vectors. The predicting x motion component is the median value of the x components of the 3 candidate vectors. The predicting y motion component is computed the same way using the 3 candidate y components. In the case that only one vector is available, that vector becomes the predicting vector. Of course, if no vectors are available, then the predicting vector is (0,0) which will be the same as (x,y) once the motion components are decoded.

For example, if this is the first block of a slice which is not 0, then the previous left vector, pl(x,y), is (0,0). Assume that the previous top (pt) and top-right (ptr) vectors are both coded within the current frame. pt(x,y) = (5,17) and ptr(x,y) = (-9,12):

 for x, MEDIAN(0,5,-9) = 0
 for y, MEDIAN(0,17,12) = 12

Thus, the predicting motion vector is (0,12). From here, the x motion component for this block is decoded from the stream and added to 0, followed by the y component which is added to 12. (0+x,12+y), clipped to a range of -32..31, is the final motion vector.

An individual motion vector component is coded as a VLC magnitude followed by a sign bit. If the next bit of the bitstream is 1 then the motion component is 0. Overall, the algorithm for motion vector component extraction resembles the following:

 if (peek_bit(bitstream) == 1)
   motion_component = 0
   skip_bit(bitstream)
 else
   motion_component = get_vlc(bitstream, motion_component_vlc_table)
   if (get_bit(bitstream) == 1)
     motion_component = -motion_component

Each 8x8 block in a frame corresponds to a motion vector. When a motion vector is decoded for an INTER mode block, the motion vectors for the 4 constituent 8x8 blocks in the 16x16 block are all set to that same motion vector. If a 16x16 block is decoded with either the INTRA or SKIP mode, the motion vectors for its 4 8x8 blocks are reset to 0, which is equivalent to indicating that the blocks were not coded in the current frame. In the case that a 16x16 block is decoded using the INTER_4MV mode, each of the 8x8 blocks ends up with a different motion vector.

If the current block is coded as INTER and only uses one motion vector for the entire 16x16 block, the motion vector is predicted from the previous left 8x8 block, and the previous top and top-right motion vectors for 16x16 blocks. To clarify, consider the following example in which each character in the grid represents a 8x8 block:

 A B C D E
 G H I J K
 M N 1 2 Q
 S T 3 4 W
   [1 2]

The [3 4] 16x16 block will use block N as previous left, block C as previous top, and block E as previous top-right.

In the case of INTER_4MV mode, each of the 4 8x8 blocks gets a different motion vector. Follow this table as an example:

    Predicted Motion Vector
          L    T   T-R
       +--------------
      1|  N    I    K
Block 2|  1    J    K
      3|  T    1    2
      4|  3    2    1

For block 3, the top-left motion vector is used rather than the top-right motion vector.

Note that the motion vector for block 1 should be predicted and its differential motion components decoded from the stream, applied, and the resulting motion vector saved, all before decoding the motion vector for block 2. The 4 motion vectors need to be fully decoded in series since they depend on each other.

Once the motion vector is fully decoded and the reference 16x16 block is copied from the appropriate block from the previous frame to the current block in the current frame, repeat the same familiar intraframe decoding process. The only difference is that you will add the codebook vectors and removed mean to the interframe block rather than to a block of zero samples.


Appendix A: SVQ1 Data Tables

This is a listing of all of the data tables necessary to encode/decode SVQ1 data.

VLC tables:

 interframe block coding types: This table is used for decoding the
   coding mode for each level 5 block in an interframe.
 motion vector component table: This table is used for extracting
   differential motion component magnitudes from the bitstream. The
   values from this table range from 1..32 (0 is a special case
   separate from the table).
 intraframe mean removal tables:
 interframe mean removal tables: These tables contain removed mean
   values. The intraframe mean removal table ranges from 0..255 while
   the interframe table ranges from -256..255.
 intraframe multistage VQ tables, one for each block level:
 interframe multistage VQ tables, one for each block level: These
   tables describe how a block is coded. The values in these tables
   range from -1..6:
   -1: skip block
    0: block is coded as mean-only (no multistage VQ)
   1..6: block is coded as mean plus 1..6 stages of VQ

Codebooks:

 intraframe vector codebooks, one for each level 0..3, 16 for each of 6 
   stages:
 interframe vector codebooks, one for each level 0..3, 16 for each of 6
   stages:
 These tables contain a large number of signed bytes that are applied
   to vector levels to reconstruct vectors for a final output frame.

Other:

 checksum table: This table contains values for calculating checksums 
   over the encoded and decoded SVQ1 data.
 string XOR table: This table contains logical XOR values used to
   obfuscate strings embedded within SVQ1 frame headers.

All of these data tables can be found in the CVS source repository for the ffmpeg project: http://mplayerhq.hu/cgi-bin/cvsweb.cgi/~checkout~/ffmpeg/libavcodec/svq1_cb.h?content-type=text/x-cvsweb-markup&cvsroot=FFMpeg http://mplayerhq.hu/cgi-bin/cvsweb.cgi/~checkout~/ffmpeg/libavcodec/svq1_vlc.h?content-type=text/x-cvsweb-markup&cvsroot=FFMpeg http://mplayerhq.hu/cgi-bin/cvsweb.cgi/~checkout~/ffmpeg/libavcodec/svq1.c?content-type=text/x-cvsweb-markup&cvsroot=FFMpeg


References

Motion vector quantizing selection system U.S. Patent #5,844,612 Inventors: Israelsen; Paul D. (North Logan, UT) Assignee: Utah State University Foundation (North Logan, UT)

Quicktime.qts mega-DLL: http://www.apple.com/quicktime/