Actimagine Video Codec
- Web page: http://www.actimagine.com
Videos of this format can be extracted with the ndstool application, a frontend for which is available at http://l33t.spod.org/ratx/DS/dslazy/ , the video files seem to use the extension .vx and starts with a signature of VXDS in the first 4 bytes.
Some of the information here was taken from https://github.com/xoreos/xoreos/blob/master/src/video/actimagine.cpp
Header (all 32-bit little endian words):
* magic "VXDS" * number of frames * width * height * unknown * frame rate * audio sample rate * number of audio streams * max video frame size * audio extradata offset * video stream information offset * number of video streams
The rest of file consists of video+audio data packed together, audio extradata (3124 bytes long) and video stream information (two 32-bit words with some unknown meaning and video data start position).
Each frame starts with 16-bit size and 16-bit number of audio frames (since single audio frames is 10-20 bytes long there may be several of them packed together). Audio data is stored immediately after video data aligned to 16 bits so in order to decode audio you need to decode video frame first.
Video codec description
This codec is based on ITU H.264 with certain simplifications and enhancements for low-resolution coding.
General coding principles
Outside residue coding, the codec uses Elias gamma' codes in unsigned and signed form. Signed gamma' codes are mapped in the following way: 1 -> 0, 2 -> 1, 3 -> -1, 4 -> 2, ...
Frame is coded in 16x16 macroblocks that can be divided recursively into smaller rectangular or square sub-blocks down to 2x2. Each frame begins with an Elias gamma' code for block coding mode (modes 8, 10, 12-13 and 16-23 should be present only for 16x16/16x8/8x16/8x8 blocks):
- 0 - vertical split (e.g. 16x16 block into two 8x16 sub-blocks, not available for 2xN blocks)
- 1 - copy block from reference 1
- 2 - horizontal split (e.g. 16x4 block into two 16x2 sub-blocks, not available for Nx2 blocks)
- 3 - copy block from reference 1 and add delta value to each pixel
- 4 - copy block with MV from reference 1
- 5 - copy block with MV from reference 2
- 6 - copy block with MV from reference 3
- 7 - plane prediction with delta
- 8 - vertical split and an additional residue for the whole block is coded at the end
- 9 - copy block from reference 2
- 10 - copy block from reference 1 and add delta value to each pixel and then residue for a whole block
- 11 - full-block intra prediction (not for blocks with minimum dimension smaller than four)
- 12 - copy block from reference 1, add residue afterwards
- 13 - horizontal split plus residue
- 14 - copy block from reference 3
- 15 - intra prediction in 4x4 sub-blocks (not for blocks with minimum dimension smaller than four)
- 16 - copy block with MV from reference 1, add residue afterwards
- 17 - copy block with MV from reference 2, add residue afterwards
- 18 - copy block with MV from reference 3, add residue afterwards
- 19 - intra prediction in 4x4 subblocks, add residue afterwards
- 20 - copy block from reference 2, add residue afterwards
- 21 - copy block from reference 3, add residue afterwards
- 22 - full-block intra prediction and residue
- 23 - plane prediction with delta and residue
Full-block intra prediction reads two methods (for luma and chroma), luma prediction methods are top, left, DC and plane; chroma prediction methods are DC, left, top and plane. In case of DC prediction averages for top and left neighbour pixels are calculated separately and then an average of them is used.
4x4 prediction uses a special cache of modes and minimum of top and left neighbour mode is used as a prediction (or 2 when they are not available). Then a bit it read and if it's one then predicted value is used, otherwise a three-bit value for a new prediction mode is read (and increased by one if it's larger than predicted mode). Modes are the same as in H.264: vertical, horizontal, DC, diagonal down left, diagonal down right, vertical right, horizontal down, vertical left, horizontal up (some of the modes may be implemented somewhat differently).
The special case is plane prediction. In this mode first bottom-right pixel is coded an an average of top-right and bottom-left neighbour block pixels (with an additional delta for modes 7 and 23), then those three pixels are used to put an average of top/bottom and right/left pixels halfway on right or bottom edge, then a centre pixel is interpolated using one of these, and then the block is divided into quarters and process of reconstructing all but bottom-right pixel continues. For 1xN and Nx1 blocks a centre pixel is calculated as an average and then block is recursively divided into two parts.
Motion compensation is performed in full-pel mode and source area should fully be within the reference frame. The three previously decoded frames may be used as the references at any point.
There are three coding modes present: motion compensation with no motion vector coded (so a predicted one should be used), motion compensation with motion vector coded, and motion compensation with an absolute motion vector and pixel difference.
For the first two modes data a motion vector is predicted per macroblock as median of top, left and top-left macroblock MVs. In case of several sub-blocks with motion vectors the last one should be saved. The last motion compensation mode does not use predicted MV and does not store motion vector either, it also always uses previous frame as the reference.
Residue is coded as groups of 8x8 semi-macroblocks made of 4x4 blocks. First there's CBP remapped from the gamma' code, top bit is for chroma blocks. Then there are coefficients for 4x4 blocks.
0x00, 0x08, 0x04, 0x02, 0x01, 0x1F, 0x0F, 0x0A, 0x05, 0x0C, 0x03, 0x10, 0x0E, 0x0D, 0x0B, 0x07, 0x09, 0x06, 0x1E, 0x1B, 0x1A, 0x1D, 0x17, 0x15, 0x18, 0x12, 0x11, 0x1C, 0x14, 0x13, 0x16, 0x19
Coefficients are coded in 4x4 block. First there's a context-dependent static Huffman codebook for coding mode that tells how many non-zero coefficients are there and how many of them form a tail of plus-minus ones. Then there's a context dependent code for telling how many zeroes are at the end of the block. For the tail of ones only signs are coded, the rest of coefficients are coded as
(gamma() << level) | get_bits(level) and then a sign where level is initially zero and is increased by one when the absolute coefficient value is greater than the limit for this level. Between coefficients there's a zero-run coded with context dependent code (the context in this case is number of possible zeroes).
The context for the whole block is the average of coded coefficients for the top and left 4x4 blocks. For chroma blocks there's one context stored for both components and it's stored as the average of coded coefficients in both U and V blocks.
2, 5, 11, 23, 47, 32768
Audio codec description
Audio is some variant of LPC with 128-sample frames coded in 20/14/12/10 bytes and codebook data (3124 bytes) stored in the container.
first LPC filter codebook (64 sets of order eight 16-bit little-endian coefficients) second LPC filter codebook (64 sets of order eight 16-bit little-endian coefficients) third LPC filter codebook (64 sets of order eight 16-bit little-endian coefficients) scale modifiers (eight 16-bit entries) base LPC filter (eight 32-bit values) starting scale (32-bit)
Codec data format
Codec data is stored in 16-bit little-endian words with first two words providing fixed frame information and three to eight words containing pulse information. Fields are packed MSB first.
7 bits - previous frame offset (0x7F - intra frame, 0x7E - zeroes, other values - copy data from 127-offset) 3 bits - scale modifier index 6 bits - first LPC codebook index
2 bits - pulse start position 2 bits - pulse packing mode 6 bits - second LPC codebook index 6 bits - third LPC codebook index
LPC filter is coded as the difference to the previous one (for intra frames the base LPC filter is used as the previous one) as the sum of entries from three LPC filter codebooks. Intra frames use filter for the whole frame, inter frames use interpolation between old and new filter on quarters of the frame.
Pulses are coded depending on mode:
- mode 0 takes eight words and places pulses at distance of three. First five three-bit values from each word are read and used as
(val * 2 - 7) * scalefor pulse values. Then low bits from those words are composed into another word and two more three-bit values are extracted;
- mode 1 takes five words, extracts eight pulses from each (using
(val * 2 - 3) * scalefor pulse values) and places pulses at distance three from each other;
- mode 2 is like mode 1 but it takes four words and places pulses at distance four from each other;
- mode 3 is like mode 1 but it takes three words and places pulses at distance five from each other.