QuickTime container

From MultimediaWiki
Revision as of 17:29, 29 January 2006 by Multimedia Mike (talk | contribs) (added 2 more video codecs)
Jump to navigation Jump to search
  • Extensions: mov, qt, mp4, m4v, m4a, m4p
  • Company: Apple

Known FOURCCs

The following sections list FOURCCs known to appear in Apple QuickTime files. Note that sometimes the FOURCC is only 3 characters and there is a space (ASCII 0x20) to round out the full 4 characters.

Video FOURCCs

Audio FOURCCs

Microsoft ID FOURCCs

These FOURCCs indicate that the audio information stsd atom also transports a Microsoft-style WAVEFORMATEX header.

Technical Description

The Apple Quicktime file format is an extremely well-defined file format. A little too well-defined, in fact. Some would even call it "over-engineered". The official Quicktime documentation is a magnificently detailed beast that gives equal time to explaining all parts of the spec, no matter how important or ignored a particular component may be in the actual implementation. The official spec can be a lot to digest at once and this document is intended to help interested programmers come up to speed on the Quicktime internals much more quickly.

This document emphasizes the components of the Quicktime file format that a programmer would need to know in order to write a general purpose Quicktime file decoder. This document also contains a discussion of decoding strategies.

Note that this document will probably never be complete since there is so much flexibility in the Quicktime format. But it is designed to cover the majority of QT files ever produced.

Byte Ordering

The first important fact to know about Quicktime files when writing a decoder is that all multi-byte numbers are big endian owing to Apple's Motorola heritage.

Atoms: The Fundamental Quicktime Building Blocks

Apple's Quicktime designers were thinking differently when they came up with the notion of an "atom" as "something that can contain other atoms". Atoms are chunks of data in that comprise a Quicktime file. Sometimes they contain data and sometimes they contain other atoms.

An atom consists of a size, a type, and a data payload. An atom is laid out as follows:

bytes 0-3    atom size (including 8-byte size and type preamble)
bytes 4-7    atom type
bytes 8..n   data

The 4 bytes allotted for the atom size field limit the maximum size of an atom to 4 GB. Quicktime also has a provision to allow atoms with 64-bit atom size fields by setting the size field 1 and adding the 8-byte size field after the atom type:

bytes 0-3    always 0x00000001
bytes 4-7    atom type
bytes 8-15   atom size (including 16-byte size and type preamble)
bytes 16..n  data

This is a logical exception since an atom always needs to be at least 8 bytes in length to account for the preamble. Therefore, if the size field is 1, load the 64-bit atom size from just after the atom type field.

General File Organization

In the abstract atom hierarchy, this is how a Quicktime file is laid out:

moov
  mvhd
  trak
    tkhd
    edts
      elst
    mdia
      mdhd
      minf
        stbl
          stsd
          stco
          co64
          stts
          stss
          stsc
          stsz
  trak
  trak
  ..
mdat
  [data]
  [data]
  [...]

Note that this is not an exhaustive tree of all possible or known atoms; these are only the atoms that have been empirically determined as "interesting" for the purposes of writing a general-purpose decoder that can handle most Quicktime files.

All Quicktime files need to have a moov atom and a mdat atom at the top level. There are other top level atoms as well, which generally are not interesting and can safely be skipped if encountered. The moov atom contains instructions for playing the data in the file. The mdat atom contains the data that will be played.

mdat, moov, trak, edts, mdia, minf, and stbl atoms

All of these atoms share a section in this document since there is nothing particularly complicated about any of them. For the most part, they all serve as containers for other atoms.

The mdat atom contains the media data in a Quicktime file. The mdat atom consists of the usual size and type atom preamble followed by all of the file's media data squished together with no extra markers. The file's moov atom must be parsed in order to determine where individual data chunks begin and end.

The moov atom contains many sub-atoms that specify how a movie is to be played.

The trak atom contains all of the information for a new media track in a QT file and contains many sub-atoms.

The edts atom contains the elst atom.

The mdia atom contains mdhd and minf atoms.

The minf atom contains an stbl atom.

The stbl atom contains the stsd, stco/co64, stsc, stts, stss, and stsz atoms.

tkhd atom

The tkhd atom is the track header.

This may be useful for obtaining the width and height of a video trak.

elst atom

The elst atom contains the edit list. The edit list contains information about the times and durations that pieces of a media track are to be presented during playback. There are many Quicktime file decoders that choose to ignore this atom. This is not a good idea. The edit list atom must be taken into account to guarantee proper A/V sync on certain files.

An edit list atom has the following structure:

bytes 0-3    atom size (including atom preamble)
bytes 4-7    atom type: 'elst'
bytes 8-11   Quicktime version information
bytes 12-15  number of edit list entries
bytes 16..   edit list entries

An individual edit list entry is 12 bytes in size and has the following structure:

bytes 0-3    edit duration (in global timescale units)
bytes 4-7    edit media time (in trak timescale units)
bytes 8-11   playback speed

mdhd atom

The mdhd atom is the media header.

stsd atom

The stsd atom is the sample table sample description atom. The contents of this atom depend on the trak type that contains it.

If a Quicktime file carries video that requires a palette, the palette information will be appended to this atom.

stco and co64 atoms

The stco atom is the sample table chunk offset atom. The co64 atom is the 64-bit chunk offset atom. The chunk offset table contains the absolute offsets of the media chunks in a QuickTime file's mdat atom.

A stco atom has the following structure:

bytes 0-3    atom size (including atom preamble)
bytes 4-7    atom type: 'stco'
bytes 8-11   QuickTime version information
bytes 12-15  number of chunk offsets
bytes 16..   chunk offset table

Each entry in the chunk offset table is a 32-bit absolute file offset that points to the beginning of a media chunk.

When filesystems began supporting filesizes larger than 4 gigabytes, the QuickTime file format needed larger numbers to represent offsets. The co64 chunk type fills this need. A co64 atom has the following structure:

bytes 0-3    atom size (including atom preamble)
bytes 4-7    atom type: 'co64'
bytes 8-11   QuickTime version information
bytes 12-15  number of chunk offsets
bytes 16..   chunk offset table

In the co64 atom, each entry in the chunk offset table is a 64-bit absolute file offset that points to the beginning of a media chunk.

stts atom

The stts atom is the sample table time-to-sample atom.

stss atom

The stss atom is the sample table sync sample atom. This atom contains a list of all samples that are sync samples. Sync samples are also known as keyframes or intra-coded frames. These samples indicate which video frames can be completely decoded on their own, without any information from other video frames.

A stss atom has the following structure:

bytes 0-3    atom size (including atom preamble)
bytes 4-7    atom type: 'stss'
bytes 8-11   QuickTime version information
bytes 12-15  number of sync samples
bytes 16..   sync sample table

Each entry in the sync sample table indicates the ID of a sample that is a sync sample. Note that this table begins numbering from 1.

As an example, if the stss atom of a video trak has 4 entries and those entries are 1, 9, 19, and 34, that means that video frames 1, 9, 19, and 34 (or 0, 8, 18, and 33 if your frames are numbered beginning at 0) are sync samples.

If a trak has no stss atom then all of the samples in the track are implicitly sync samples.

stsc atom

The stsc atom is the sample table sample-to-chunk atom.

stsz atom

The stsz atom is the sample table size size atom. This atom contains the sizes of all of the samples in a trak.

bytes 0-3    atom size (including atom preamble)
bytes 4-7    atom type: 'stsz'
bytes 8-11   QuickTime version information
bytes 12-15  size of each sample
bytes 15-19  number of sample sizes
bytes 20..   sample size table

The stsz atom can operate in one of two modes. First, it is possible that all of the samples in a trak have the same size. In this case, the field at bytes 12-15 is set to the constant size. The field at bytes 16-19 is set to the total number samples in the trak. And there is no sample size table starting at byte 20. This mode is commonly used in the stsz atom of audio traks. For example, in a audio file of length or 2 seconds that has a sample rate of 22050 Hz will set the field at bytes 12-15 to 1, indicating that the the size of each sample is 1. The field at bytes 15-19 will be set to (22050 samples/sec * 2 sec) = 44100 samples.

In the second mode, all of the samples are a different size (logically, this mode would have to be used even if all of the samples were the same size except for one). In this case, the field at bytes 12-15 is set to 0. The field at bytes 16-19 contains the number of entries in the sample size table. Each entry in the sample size table contains the size of a sample in the trak.

Meta Data

A 'meta' atom contains atoms containing human-readable textual data with meta information regarding the file. These atoms are marked with 4 bytes of course but the first byte is a value of 0xA9. The remaining 3 characters can be:

  • nam: Name of song or video
  • cpy: Copyright information
  • des: File description
  • cmt: General comment
  • alb: Album name
  • gen: ? Name of generating program?
  • ART: Artist name
  • too: ?
  • wrt: ?
  • day: ? Modification date?

Decompressing Compressed moov Atoms With zlib

The prospect of having to decode compressed moov atoms in Quicktime files seems to give many programmers pause. This need not be the case. When a compressed moov atom is detected, the free, open source zlib compression library can be called upon to do all the hard work.

In the abstract atom hierarchy, a compressed moov atom is laid out like this:

moov
  cmov
    dcom
    cmvd

On disk, a compressed moov atom will look this this:

bytes 0-3:   atom size (including 8-byte size and type preamble)
bytes 4-7:   atom type ('moov', movie header)
bytes 8-11:  atom size (including 8-byte size and type preamble)
bytes 12-15: atom type ('cmov', compressed movie header)
bytes 16-19: atom size (this should be 12 bytes)
bytes 20-23: atom type ('dcom', decompressor)
bytes 24-27: decompression library used (usually 'zlib')
bytes 28-31: atom size (including 8-byte size and type preamble)
bytes 32-35: atom type ('cmvd', compressed movie header data)
bytes 36-39: size of decompressed data
bytes 40-n:  compressed data

Note that this structure makes it theoretically possible to use other libraries to compress moov atoms, but zlib is most commonly used.

Here is a lazy algorithm for decompressing a compressed moov atom:

  1. check if bytes 12-15 contain 'cmov'; if yes:
  2. allocate a buffer for the decompressed moov atom, the size of which is specified by bytes 36-39
  3. initialize the zlib library, initialize a z_stream structure with pointers to the compressed and decompressed buffers, and all the other necessary variables
  4. call zlib to decompress the atom
  5. free the compressed moov atom, process the newly-decompressed moov atom (which will begin with a proper size and 'moov' type)

As an aside, one might wonder about the rationale behind compressing moov atoms. The data inside QT files can reach gargantuan sizes, and the moov atom will be rather tiny in comparison. Why bother saving a few tens of kilobytes on the moov atom? One suggestion I have received is data integrity: Compression with zlib offers CRC validation. If an error occurs in the data stream while transmitting the compressed moov atom, a problem will be detected during decompression.

References