Welcome to Asecuritysite.com

MPEG-4

[Communications Home][Home]

MP4 is a video format which integrates the MPEG video encoding standard. This page uses ffmpeg.exe to determine the basic coding details of different MP4 files. The format used is:

ffmpeg -i file.mp4

MP4

Select an MP4 file:

[Sample 1:] [Sample 2:]
[Sample 3: ][Sample 4: ]
[Sample 5: ] [Sample 6: ]
[Sample 7: ] [Sample 8: ]
[Sample 9: ] [Sample 10: ]
[Sample 11: ] [Sample 12: ]

Video

Text

libavdevice    57.  0.100 / 57.  0.100
  libavfilter     6. 22.100 /  6. 22.100
  libswscale      4.  0.100 /  4.  0.100
  libswresample   2.  0.101 /  2.  0.101
  libpostproc    54.  0.100 / 54.  0.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'c:\inetpub\wwwroot\log\1mb.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    creation_time   : 1970-01-01 00:00:00
    encoder         : Lavf53.24.2
  Duration: 00:00:05.31, start: 0.000000, bitrate: 1589 kb/s
    Stream #0:0(und): Video: h264 (Main) (avc1 / 0x31637661), yuv420p, 1280x720 [SAR 1:1 DAR 16:9], 1205 kb/s, 25 fps, 25 tbr, 12800 tbn, 50 tbc (default)
    Metadata:
      creation_time   : 1970-01-01 00:00:00
      handler_name    : VideoHandler
    Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, 5.1, fltp, 384 kb/s (default)
    Metadata:
      creation_time   : 1970-01-01 00:00:00
      handler_name    : SoundHandler
At least one output file must be specified

Theory

Motion video contains massive amounts of redundant information. This is because each image has redundant information and because there are very few changes from one image to the next. Motion video image compression relies on two facts:

Images have a great deal of redundancy (repeated images, repetitive, superfluous, duplicated, exceeding what is necessary, and so on).
The human eye and brain have limitations on what they can perceive.

As with JPEG, the Motion Picture Experts Group (MPEG) was setup to develop an interna-tional open standard for the compression of high-quality audio and video information. At the time, CD-ROM single-speed technology allowed a maximum bit rate of 1.2 Mbps and this was the rate that the standard was built around. These days, x12 and x20 CD-ROM bit rates are common, which allow for smoother and faster animations.

MPEG’s main aim was to provide good quality video and audio using hardware proces-sors (and in some cases, on workstations with sufficient computing power, to perform the tasks using software). Figure 1 shows the main processing steps of encoding:

Image conversion – normally involves converting images from RGB into YUV (or YCrCb) terms with optional color subsampling.
Conversion into slices and macro-blocks – a key part of MPEG’s compression is the detection of movement within a frame. To detect motion a frame is subdivided into slices then each slice is divided into a number of macroblocks. Only the luminance compo-nent is then used for the motion calculations. In the subblock, luminance (Y) values use a 16x16 pixel macroblock, whereas the two chrominance components have 8x8 pixel macroblocks.
Motion estimation – MPEG uses a motion estimation algorithm to search for multiple blocks of pixels within a given search area and tries to track objects which move across the image.
DCT conversion – as with JPEG, MPEG uses the DCT method. This transform is used because it exploits the physiology of the human eye. It converts a block of pixels from the spatial domain into the frequency domain. This allows the higher-frequency terms to be reduced, as the human eye is less sensitive to fast changes of color, or luminance.
Encoding – the final stages involve organizing the data into a pattern which will tend to produce long runs of zero value. These are then compressed with run-length encoding and then finally with fixed Huffman code to produce a variable-length code.

Figure 11.1 MPEG encoding with block matching

MPEG-1 video compression

MPEG-1 typically uses the CIF format for its input, which has the following parameters:

For NTSC, 352x240 pixels for luminance and 176x120 pixels for U and V color-difference components (that is, 4:1:1 subsampling).
For PAL/SECAM, 352x288 pixels for luminance and 176x144 pixels for U and V color difference components (i.e. 4:1:1 subsampling).

This gives a picture quality which is similar to VCR technology. MPEG-1 differs from con-ventional TV in that it is non-interlaced (known as progressive scanning), but the frame rate is the same as conventional TV, i.e. 25fps (for PAL and SECAM) and 30fps (for NSTC). Note that MPEG-1 can also use larger pixel frames, such as CCIR-601 740480, but the CIF format is the most frequency used.

Taking into account the interlacing effect, the CIF format is actually derived from the CCIR-601 format. The CCIR-601 digital television standard defines a picture size of 720x243 (or 240) by 60 fields per second. Note that a frame actually comprises two fields, where the odd and even information is interlaced to create the full picture. When the interlaced luminance information occupies the full 720x480 frame, the chrominance components are reduced by 4:2:2 subsampling to give 360x243 (or 240) by 60 fields per second.

MPEG-1 also reduces the chrominance components by reducing the pixel data by half in the vertical, horizontal and time directions. It also reduces the image size so that the number of pixels for it is divisible by 8 or 16. This is because the motion analysis and DCT conversion operate on x or 8x8 pixel blocks. As a result, the number of lines changes for an MPEG-1 encoded move between the NSTC standard and PAL and SECAM standards. The final figure for PAL and SECAM is 288 at 50fps; for NSTC it is 240 at 60fps. These require the same number of bits to encode the streams.

The MPEG encoded bitstream comprises three components: compressed video, com-pressed audio and system-level information. To provide easier synchronization and lip synching the audio and video streams are time stamped using a 90kHz reference clock.

Color space conversion

The first stage of MPEG encoding is to convert a video image into the correct color space format. In most cases, the incoming data is in 24-bit RGB color format and is converted in 4:2:2 YCrCb (or YUV) form. Some information will obviously be lost in the conversion of the color components, as there will only be half the number of samples for the redness and the blueness as there is for the luminance, but it results in some compression.

Slices and macroblocks

MPEG compression tries to detect movement within a frame. This is done by subdividing a frame into slices and then subdividing each slice into a number of macroblocks. For exam-ple, a PAL format which has:

	352x288 pixel frame (101376 pixels)

can, when divided into 16x16 blocks, give a whole number of 396 macroblocks. Dividing 288 by 16 gives a whole number of 18 slices, and dividing 352 gives 22. Thus the image is split into 22 macroblocks in the x-direction and 18 in the y-direction, as illustrated in Figure 2.

Luminance (Y) values use a 16x16 pixel macroblock, whereas the two chrominance components have 8x8 pixel macroblocks. Note that only the luminance component is used for the motion calculations, as motion that is detected in the luminance components is likely to be the same as the one in the redness and blueness components.

Figure 2 Segmentation of an image into subblocks

Motion estimation

MPEG uses a motion estimation algorithm to search for multiple blocks of pixels within a given search area and tries to track objects which move across the image. Each luminance (Y) 1616 macroblock is compared with other macroblocks within either a previous or fu-ture frame to find a close match. When a close match is found, a vector is used to describe where the block is to be located, as well as any difference information from the compared block. As there tends to be very few changes from one frame to the next, it is far more efficient than using the original data.

Figure 3 shows two consecutive images of 2D luminance made up into 165 mega-blocks. Each of these blocks has 1616 pixels. It can be seen that, in this example, there are very few differences between the two images. If the previous image is transmitted in its en-tirety then the current image can be transmitted with reference to the previous image. For example, the megablocks for (0,0), (0,1) and (0,2) in the current block are the same as in the previous blocks. Thus they can be coded simply with a reference to the previous image. The (0,3) megablock is different to the previous image, but the (0,3) block is identical to the (0,2) block of the previous image, thus a reference to this block is made. This can continue, as most of the blocks in the image are identical to the previous image. The only other differ-ences in the current image are at (4,0) and (4,1); these blocks can be stored in their entirety or specified with their differences to a previous similar block. A major objective of the MPEG encoder is to spend a much greater time compressing the video information into its most efficient form. Each macroblock is compared mathematically with other blocks in a previous frame, or even in a future frame. The offset information to another block can be over a macroblock boundary or even over a pixel boundary. This comparison then repeats until a match is found or the specified search area within the frame has been exhausted. If no match is available, the search process can be repeated using a different frame, or the macroblock can be stored as a complete set of data. As previously stated, if a match is found, the vector information specifying where the matching macroblock is located is used along with any difference information.

Figure 3 Two consecutive images

As the technique involves many searches over a wide search area and there are many frames to be encoded, the encoder must normally be a high-powered workstation. This has several implications:

Asymmetrical compression. MPEG uses an asymmetrical compression process, where a relatively large amount of computing power is required for the encoder and much less for the decoder. The encoding process is normally achieved in non-real time, whereas the de-coder reads the data in real-time (as the user requires to view the video, without having large pauses, while the decoder processes the compressed data). As processing power and memory capacity increase, more computers will be able to compress video information in real time. Even mobile devices will have the processing power and memory to be able to process MPEG in real time.
Compression quality. Encoders influence the quality of the decoded image dramatical-ly. If the encoder takes shortcuts, such as limited search areas and macroblock matching, it can result in poor picture quality, irrespective of the quality of the decoder.
Memory requirements. The decoder normally requires a large amount of electronic memory to store past and future frames, which may be needed for motion estimation.

With the motion estimation completed, the raw data describing the frame is now converted by DCT algorithm to be ready for Huffman coding.

I, P and B-frames

As video frames tend not to change much between frames, MPEG video compression uses either full frames (which contain all the frame data), or partial frames (which refer back to other frames). The three frames types are defined as:

Intra frame (I-frame). An intra frame, or I-frame, is a complete image and does not require any extra information to be added to it to make it complete. As it is a complete frame, it cannot contain any motion estimation processing. It is typically used as a starting point for other referenced frames, and is usually the first frame to be sent.
Predictive frame (P-frame). The predictive frame, or P-frame, uses the preceding I-frame as its reference and has motion estimation processing. Each macroblock in this frame is supplied as referenced to an I-frame as either a vector and difference, or if no match was found, as a completely encoded macroblock (called an intracoded mac-roblock). The decoder must thus retain all I-frame information to allow the P-frame to be decoded.
Bidirectional frame (B-frame). The bidirectional frame, or B-frame, is similar to the P-frame except that it references frames to the nearest preceding or future I- or P-frame. When compressing the data, the motion estimation works on the future frame first, fol-lowed by the past frame. If this does not give a good match, an average of the two frames is used. If all else fails, the macroblock can be intracoded. As B-frames reference preceding and future frames, it requires that many I- and P-frames are retained in memory, which will require a relatively large amount of memory, as apposed to using I-frames, and P-frames, only.

Referencing this page

This site is currently free to use and does not contain any advertisements, but should be properly referenced when used in the dissemination of knowledge, including within blogs, research papers and other related activities.Sample reference forms are given below.

Ref: Buchanan, William J (2024). Mp4. Asecuritysite.com. https://asecuritysite.com/comms/mp4

Bib: @misc{asecuritysite_27647, title = {Mp4}, year={2024}, organization = {Asecuritysite.com}, author = {Buchanan, William J}, url = {https://asecuritysite.com/comms/mp4}, note={Accessed: April 17, 2024}, howpublished={\url{https://asecuritysite.com/comms/mp4}} }

Licence: This site is intended for the education and advancement of humans, and no rights are given for AI and ML bots to crawl this site. All references to its content must be included.

Follow @billatnapier Tweet #Asecuritysite