Introduction to MPEG 2 Video Compression
The international standard ISO/IEC 13818-2 "Generic Coding of Moving Pictures and Associated Audio Information: Video", and ATSC document A/54 "Guide to the Use of the ATSC Digital Television Standard" describe a system , known as MPEG-2, for encoding and decoding digital video data. Digital video data is encoded as a series of code words in a complicated manner that causes the average length of the code words to be much smaller than would be the case, if for example, each pixel in every frame was coded as an 8 bit value. This is also known as data compression. The standard allows for the encoding of video over a wide range of resolutions, including higher resolutions commonly known as HDTV.
In this system, encoded pictures are made up of pixels. Each 8x8 array of pixels is known as a block. A 2x2 array of blocks is termed a macroblock. Compression is achieved using the well known techniques of prediction (motion estimation in the encoder, motion compensation in the decoder), 2 dimensional discrete cosine transform (DCT) performed on 8x8 blocks of pixels, quantization of DCT coefficients, and Huffman and run/level coding. Pictures called I pictures are encoded without prediction. Pictures termed P pictures may be encoded with prediction from previous pictures. B pictures may be encoded using prediction from both previous and subsequent pictures. A simplified MPEG-2 encoder and decoder are shown in the MPEG Coder/Decoder Diagram.
The encoding process for P and B pictures is explained as follows.
Data representing macroblocks of pixel values for a picture to be encoded are fed to both the subtractor and the motion estimator. The motion estimator compares each of these new macroblocks with macroblocks in a previously stored reference picture or pictures. It finds the macroblock in the reference picture that most closely matches the new macroblock. The motion estimator then calculates a motion vector (mv) which represents the horizontal and vertical displacement from the macroblock being encoded to the matching macroblock-sized area in the reference picture . Note that the motion vectors have 1/2 pixel resolution achieved by linear interpolation between adjacent pixels.
The motion estimator also reads this matching macroblock (known as a predicted macroblock) out of the reference picture memory and sends it to the subtractor which subtracts it, on a pixel by pixel basis, from the new macroblock entering the encoder. This forms a error prediction or residual signal that represents the difference between the predicted macroblock and the actual macroblock being encoded. This residual is often very small.
The residual is transformed from the spatial domain by a 2 dimensional DCT.
(The two dimensional DCT consists of separable vertical and horizontal one-dimensional DCTs.) The DCT coefficients of the residual are then quantized in a process that reduces the number of bits needed to represent each coefficient. Usually many coefficients are effectively quantized to 0.
The quantized DCT coefficients are Huffman run/level coded which further reduces the average number of bits per coefficient. This is combined with motion vector data and other side information (including an indication of I, P or B picture) and sent to the decoder.
For the case of P pictures, the quantized DCT coefficients also go to an internal loop that represents the operation of the decoder (a decoder within the encoder). The residual is inverse quantized and inverse DCT transformed. The predicted macroblock read out of the reference picture memory is added back to the residual on a pixel by pixel basis and stored back into memory to serve as a reference for predicting subsequent pictures. The object is to have the data in the reference picture memory of the encoder match the data in the reference picture memory of the decoder. B pictures are not stored as reference pictures.
The encoding of I pictures uses the same circuit, however no motion estimation occurs and the (-) input to the subtractor is forced to 0. In this case the quantized DCT coefficients represent transformed pixel values rather than residual values as was the case for P and B pictures. As is the case for P pictures, decoded I pictures are stored as reference pictures.
The system can encode sequences of progressive or interlaced pictures. For interlaced sequences, pictures may be encoded as field pictures or as frame pictures. For progressive sequences, all pictures are frame pictures with frame DCT coding and frame prediction. Further explanation is found in Field DCT Coding and Frame DCT Coding.
The decoding process can be thought of as the reverse of the encoding process. Refer to the MPEG Coder/Decoder Diagram.
The received encoded data is Huffman/run-level decoded. Motion vectors are parsed from the data stream and fed to the motion compensator. Quantized DCT coefficients are fed to the inverse quantizer and then to an IDCT circuit that transforms them back to the spatial domain. For P and B pictures, motion vector data is translated to a memory address by the motion compensator to read a particular macroblock (predicted macroblock) out of a previously stored reference picture. The adder adds this prediction to the residual to form reconstructed picture data. For I pictures, there are no motion vectors and no reference picture, so the prediction is forced to zero. For I and P pictures, the adder output it is fed back to be stored as a reference picture for future predictions.
Introduction to MPEG 2 Video Compression (this page)
MPEG Coder/Decoder Diagram
Profiles and Levels
Frames, Fields, Pictures (I, P, B)
I P B Picture Reordering
MPEG 2 Video Data Structures