HuffYUV

From MultimediaWiki
Jump to navigation Jump to search

This page is based on the document 'Description of the HuffYUV (HFYU) Codec' by Roberto Togni found at http://multimedia.cx/huffyuv.txt.

HuffYUV is a lossless video compressor for AVI files written by Ben Rudiak-Gould. Source code is available and distributed (mostly) under the GNU General Public License. This codec was created to temporarily store video data coming from a capture card for later editing, so its main features are:

  • speed: uses simple predictors and original version is heavily optimized.
  • quality: lossless, so image is not degraded if read/written many times.
  • editable: every frame is intracoded (a.k.a., a keyframe) so there are no interframe dependencies and each frame can be seeked to independently and decoded on its own.

Tech Notes From Author's Page (referring to v 2.1.1)

HuffYUV's algorithm is roughly the same as lossless JPEG: it predicts each sample and Huffman-encodes the error. The predictor functions are:

  • "left" (predicts the previous sample from the same channel)
  • "gradient" (predicts Left+Above-AboveLeft)
  • "median" (predicts the median of Left, Above, and the gradient predictor). Channels are compressed separately, but in RGB mode the channels used are actually R-G, G, and B-G. This yields much better compression than R, G, B.

The error signal in each channel is encoded with its own Huffman table. On compression HuffYUV picks appropriate tables from its built-in collection. These tables are then stored in the output file and used when decompressing. This way future versions of HuffYUV can decompress old files without my having to explicitly support old tables. A Huffyuv-savvy application can also specify the Huffman tables to be used for compression instead of accepting the defaults.

File Header

All information about the image format and predictor are stored into the BITMAPINFOHEADER structure of the AVI file header.

BITMAPINFOHEADER Layout

Only the interesting fields are shown here:

HUFFYUV_BITMAPINFOHEADER {
 int biSize
 [...]
 short biBitCount
 int biCompression
 [...]
 Extradata {
  BYTE method
  BYTE bpp_override
  BYTE unused[2]  
 }
 BYTE Compressed_Huffmantables[]
}

Extradata and Compressed_Huffmantables fields are not always there.

biCompression contains "HFYU" FOURCC.

biSize is the size of the whole structure and it is used to find out if extra fields are there, since size of standard BITMAPINFOHEADER is known.

To find out image format you need to get true bit count: it is stored in bpp_override if biSize > sizeof(BITMAPINFOHEADER), else (or if bpp_override is 0) it is stored in biBitCount. Then you can retrieve the image format with this table

+-----+-------+
| bit | image |
+-----+-------+
| 16  | YUY2  |
| 24  | RGB   |
| 32  | RGBA  |
+-----+-------+

In the rest of this description YUV will indicate image of type YUY2, and RGB will be used for both RGB and RGBA.

To find out the prediction method you have to mask the 3 least significant bits of biBitCount (biBitCount & 0x07) and use them to index into this table:

+-----+-------------+
| bit |   method    |
+-----+-------------+
| 000 | see below   |
| 001 | left        |
| 010 | left decorr |
| 011 | gradient *  |
| 100 | median      |
+-----+-------------+

If the gradient method is selected (011) and the colorspace is RGB, the method is actually gradient with decorrelation.

If (biBitCount & 0x07) is 0, method is stored in extradata if biSize > sizeof(BITMAPINFOHEADER), else method is predict_old (the method used in version 1.x, it's the same as predict left). To find out method from extradata method byte use this table

+------+-----------------+
| byte |   method        |
+------+-----------------+
|  -2  | old             |
|   0  | left            |
|   1  | gradient        |
|   2  | median          |
|  64  | left decorr     |
|  65  | gradient decorr |
+------+-----------------+

Huffman tables may be stored in the file (v2.x) or not (v1.x). If biSize = sizeof(BITMAPINFOHEADER) tables are not stored into file, and to decode data the decoder must know classic tables, choosing RGB or YUV as appropriate. Otherwise, tables are stored in file: if (biBitCount & 0x07) = 0 tables start at the end of BITMAPINFOHEADER, else they start after Extradata.

Huffman Tables

Huffman tables are stored in file compressed with a run-length algorithm. This is what Ben Rudiak-Gould says about it in source code (tables.cpp)

// The Huffman tables are run-length encoded.  (I could have Huffman-
// encoded them, but the madness has to end somewhere.)  The
// decompression algorithm is as follows: Read a byte.  The lower five
// bits are the value to repeat.  If the upper three bits are zero, the
// repeat count is in the next byte; otherwise, the upper three bits are
// the repeat count.  The tables are zero-terminated so that I can use
// string functions on them (a zero byte can never appear in the RLE
// data).

// Each array actually contains three Huffman tables, one for each sample
// type (e.g. Y,U,V, or R-G,G,B-G).  Each table expands to 256 bytes, and
// each byte is a code length.  The codes themselves are determined at
// run time and are not stored here--except for the "classic" tables,
// which don't fit the code-choosing algorithm.

When the 3 tables are available (either extracted and decompressed, or known because the file was compressed with v1.x of the codec) the decoder can extract data from the encoded byte stream.

Tables are stored in file in this order:

  • YUV: Y_table, U_table, V_table
  • RGB: B_table, G_table, R_table
  • RGB with decorrelation: B-G_table, G_table, R-G_table

Frame decompression

General info

One frame is stored per AVI chunk. All frames in file are marked as keyframes. You only need data from current chunk and file header to decompress a frame (no knowledge of previous frames or frame position in file is needed).

Operating Algorithm

The HuffYUV codec compresses an image by predicting the value of a pixel from its neighbors, and computes an error value (delta) by subtracting it from the effective pixel value. The result of this operation is then compressed with a Huffman algorithm, using a different table for each channel. To decompress a frame the decoder needs to reverse this process: extract the error value from the compressed Huffman stream, then reconstruct the pixel value by computing the predictor value and adding it to the error value. To make this process possible you need a start value (the first pixel in a frame) and a predictor that uses only values from past pixels (already decompressed). Obviously, all predictors used in HuffYUV satisfy this requirement.

All pixel component values must be treated as unsigned bytes. In all computations overflow or underflow causes a wrap-around (i.e., no saturation).

If decorrelation is used (RGB only), then instead of compressing R, G, B components the codec compresses R-G, G, B-G usually resulting in better compression. To reconstruct B and R values you have to add them the value of G taken from the same pixel.

Bitstream

The components of the image are compressed with different Huffman tables and then the bitstreams are interleaved.

As an example suppose there are 3 components (the values represent the difference between the pixel value and the predictor output). The byte sequence [ABC][DEF][GHI][LMN] is compressed in this way:

  • Compress each byte of the first triplet [ABC] using the Huffan table associated with its component. The decoder will output 3 variable length codes. Suppose you get aaaa by compression of A, bbbbbbb from B and ccc from C. The output bitstream will be aaaabbbbbbccc.
  • Repeat the process for every triplet, adding the new bits at the end of the stream. If compression of [DEF] will give dddd eee fffff, the output stream after second pixel will be aaaabbbbbbcccddddeeefffff.

To decompress this sequence, keep extracting bits from the stream until you get a valid code (the table contains all valid codes), and then you decode it using the table associated with that component. In the example above you will get the 4 bit aaaa and decode them to A. After decoding the current component you need to remove decoded bits from the stream (the bitstream now contains bbbbbbcccddddeeefffff), and start decoding the next component. When the first triplet [ABC] is decoded the stream will be ddddeeefffff, and the decoder is ready to start again with the first component of the next pixel.

Data order

This is how data is stored in then encoded stream. The meaning of the stream format line is:

  • a capital letter is used to indicate an uncompressed value 1 byte long
  • small letters represent Huffman-coded values

The length in bits of Huffman-coded values depend on the size of the code word (after decompression all delta values are 1byte long).

YUV Data

Image data is encoded and decoded left to right, top to bottom. Each pixel pair is represented by 4 bytes using YUY2 format. Frame width is always even.

Stream format: Y U Y V y u y v y u y v ....

where YUYV are the first two pixels, and y, u, v are the error values. y is decompressed with Y_table, u with U_table and v with V_table.

RGB Data

Image is stored left to right, bottom to top. Each pixel is represented by 3 bytes, using BGR24 format.

Stream format: X B G R b g r b g r ....

where X is unused, BGR is the first pixel, and b, g, r are the error values. b is decompressed with B_table, g with G_table and r with R_table.

RGB Data With Decorrelation

Image is stored left to right, bottom to top. Each pixel is represented by 3 bytes, using BGR24 format.

Stream format: X B G R g bg rg g bg rg ....

where X is unused, BGR is the first pixel, and g, bg, rg are the error values. g is decompressed with G_table, bg with B-G_table and rg with R-G_table.

RGBA Data

Image is stored left to right, bottom to top. Each pixel is represented by 4 bytes, using BGR24 format plus alpha channel.

Stream format: B G R A b g r a b g r a ....

where BGRA is the first pixel, and b, g, r, a are the error values. b is decompressed with B_table, g with G_table, r with R_table and a with R_table.

RGBA Data With Decorrelation

Image is stored left to right, bottom to top. Each pixel is represented by 4 bytes, using BGR24 format plus alpha channel.

Stream format: B G R A g bg rg a g bg rg a ....

where BGRA is the first pixel, and g, bg, rg, a are the error values. g is decompressed with G_table, bg with B-G_table, br with R-G_table and a with R-G_table.

Predictors

Predictors are used to guess the value of the next pixel so that the codec needs to store only the error between the real pixel value and the prediction. If the prediction model is good the error will be small, and will compress better than full pixel values.

Valid predictors are:

  • YUV: old, left, gradient, median.
  • RGB: old, left, left with decorrelation, gradient with decorrelation.

Other combinations (e.g. RGB gradient without decorrelation), while technically feasible, are not used.

While describing predictors, the following symbols are used:

  • [A B C]: components of a pixel
  • Ayz: component A of pixel at row y and column z
  • a, b, c: indices used for columns
  • n, m: indices used for rows
  • x: index used to indicate the last column of a frame

Predict Left, Predict Left With Decorrelation, And Predict Old

Predict old is the same as predict left (without decorrelation for RGB case).

The predictor for a pixel component is the same component from previous pixel (previous pair for U, V in YUV case).

Some examples:

  • random pixel inside a row, RGB
 ...... [Ban Gan Ran] [Bam Gam Ram] ......

The predictor for Gam is Gan

  • random pixel inside a row, YUV
 .......[Y1an Uan Y2an Van] [Y1am Uam Y2am Vam] .......

The predictor for Uam is Uan, for Y1am is Y2an, for Y2am is Y1am

  • first pixel in a row (not for first row), RGB
[Ba1 Ga1 Ra1] ........ [Bax Gax Rax]
[Bb1 Gb1 Rb1] ......

The predictor for Rb1 is Rax

  • first pixel in a row (not for first row), YUV
[Y1a1 Ua1 Y2a1 Va1] ........ [Y1ax Uax Y2ax Vax]
[Y1b1 Ub1 Y2b1 Vb1] ......

The predictor for Ub1 is Uax, for Y2b1 is Y1b1, for Y1b1 is Y2ax

The first pixel of the picture is stored uncompressed, the rest of the frame is compressed with the predict left algorithm.

Predict gradient and predict gradient with decorrelation

The predictor for a pixel is given by the sum of the previous pixel (like predict left) and the above pixel (the pixel from the row above at the same column), minus the above left pixel (the pixel from the row above at the previous column).

Some examples:

  • random pixel inside a row, RGB
 ...... [Ban Gan Ran] [Bam Gam Ram] ......
 .......[Bbn Gbn Rbn] [Bbm Gbm Rbm]

The predictor for Gbm is (Gbn + Gam - Gan)

  • random pixel inside a row, YUV
 .......[Y1an Uan Y2an Van] [Y1am Uam Y2am Vam] .......
 .......[Y1bn Ubn Y2bn Vbn] [Y1bm Ubm Y2bm Vbm] .......
    • The predictor for Ubm is (Ubn + Uam - Uan)
    • The predictor for Y1bm is (Y2bn + Y1am - Y2an)
    • The predictor for Y2bm is (Y1bm + Y2am - Y1am)
  • first pixel in a row (not for first row), RGB
[Ba1 Ga1 R11] ........ [Bax Gax Rax]
[Bb1 Gb1 Rb1] ........ [Bbx Gbx Rbx]
[Bc1 Gc1 Rc1] ......
    • The predictor for Rc1 is (Rbx + Rb1 - Rax)
  • first pixel in a row (not for first row), YUV
[Y1a1 Ua1 Y2a1 Va1] ........ [Y1ax Uax Y2ax Vax]
[Y1b1 Ub1 Y2b1 Vb1] ........ [Y1bx Ubx Y2bx Vbx]
[Y1c1 Uc1 Y2c1 Vc1] ......
    • The predictor for Uc1 is (Ubx + Ub1 - Uax)
    • The predictor for Y2c1 is (Y1c1 + Y2b1 - Y1b1)
    • The predictor for Y1c1 is (Y2bx + Y1b1 - Y2ax)

The first pixel of the picture is stored uncompressed, and the remainder of the first row is compressed with the predict left algorithm. Other rows are compressed with predict gradient. The above left value for the first pixel of the second row is 0.

Maybe other pixels of the second row should ignore above left values, but I'm not sure about it.

Predict Median (YUV only)

The predictor for a pixel is given by the median value among the left pixel, the pixel above and the gradient predictor (sum of the previous pixel and the above pixel, minus the above left pixel). The median is obtained by sorting the three values and taking the one in the middle, and have nothing to do with the mean value of the three numbers. For example, if the left pixel is 0x12, the above pixel is 0x4e and the gradient predictor is 0xaf, the median is 0x4e.

As Ben Rudiak-Gould says on his page, there is no technical reason for this method to be YUV only, he simply didn't implement it for RGB case.

Some examples:

  • random pixel inside a row, YUV
 .......[Y1an Uan Y2an Van] [Y1am Uam Y2am Vam] .......
 .......[Y1bn Ubn Y2bn Vbn] [Y1bm Ubm Y2bm Vbm] .......
    • The predictor for Ubm is the median value of Ubn, Uam and (Ubn + Uam - Uan)
    • The predictor for Y1bm is the median value of Y2bn, Y1am and (Y2bn + Y1am - Y2an)
    • The predictor for Y2bm is the median value of Y1bm, Y2am and (Y1bm + Y2am - Y1am)
  • first pixel in a row (not for first row), YUV
[Y1a1 Ua1 Y2a1 Va1] ........ [Y1ax Uax Y2ax Vax]
[Y1b1 Ub1 Y2b1 Vb1] ........ [Y1bx Ubx Y2bx Vbx]
[Y1c1 Uc1 Y2c1 Vc1] ......
    • The predictor for Uc1 is the median value of Ubx, Ub1 and (Ubx + Ub1 - Uax)
    • The predictor for Y2c1 is the median value of Y1c1, Y2b1 and (Y1c1 + Y2b1 - Y1b1)
    • The predictor for Y1c1 is the median value of Y2bx, Y1b1 and (Y2bx + Y1b1 - Y2ax)

The first pixel of the picture is stored uncompressed. The remaining of the first row, and the first 4 pixel (2 pairs, 8 bytes) of the second row are compressed with the predict left algorithm. The rest of the second row and other rows are compressed with predict median.

Technically the second pixel of the second row could be compressed using median predictor; it's compressed with left predictor because original code uses MMX instructions and can handle 8 bytes a time.

Interlaced images

If image height is greater than 288 pixel, the image is treated as interlaced. The decoder must build an image with double width and half height, obtained by placing the two fields side by side, with the odd field on the left. The frame size reported in the file header is unchanged, the only way to find out interlacing is by comparing the image height with 288.

As an example, suppose your original frame is

1111111111111111
2222222222222222
3333333333333333
4444444444444444
5555555555555555
6666666666666666
7777777777777777
8888888888888888

then the frame to be compressed will be

11111111111111112222222222222222
33333333333333334444444444444444
55555555555555556666666666666666
77777777777777778888888888888888

In this way every line on a field gets the values for the predictors from the previous line of the same field, instead of using the previous line of the frame.

Please note that it makes a difference only for prediction methods that use values from the previous line (median and gradient algorithms): if a method refers only to the value of the previous pixel (left algorithms) interlacing can be ignored.

Final notes

The frame width must be divisible by 4 (at least for the original codec).

It is likely that the header parsing logic can be simplified a lot since not every combination of options is valid. This is what I was able to create using dll v2.1.1 and 1.2.3/1.3.1 and VirtualDub under Windows:

  • Old format (v1.x only) is only used with classical tables (not stored in file)
  • Other methods (v2.1.1) always store tables in file, and uses extradata to store method.

The first two(?) pixels (or 4 bytes) of the second row in predict gradient could be stored in a different way, using only left and above pixel as predictor. None of my samples require this.

Also found on the web " the exact nature of the VLC bitstream: 32-bit words in little endian format, codewords placed starting from the MSB, Huffman codes allocated longest codeword first, and with a maximum codeword length of 31 bits." http://www.virtualdub.org/blog/pivot/entry.php?id=203