MP4Player internals

December, 2002
Bill May
Cisco Systems

Session
Media
Bytestream
Decode Plugin
Media Data Flow
Threading
Rendering and Synchronization
Finally

This will attempt to describe the internal workings of mp4player for those that are interested.

mp4player was intended as a mpeg4 streaming only player, but since we were going to handle multiple audio codecs, I made an early decision to support both local and streaming playback, as well as multiple audio and video codecs.

mp4player (and its derivative applications gmp4player and wmp4player) is broken up into 2 parts - libmp4player, which contains everything for playback, and playback control code (for example, in gmp4player, this would be the GTK gui code).

There are a few major concepts that need to be understood. These are the concepts of a session, a media, a bytestream, and a decode plugin.

Session

A session is something that the user wants to play. This could be audio only, video only, both, multiple videos with audio, whatever. A session is represented by the CPlayerSession class, which is the major structure/API of libmp4player. The CPlayerSession class is also responsible for the synchronization of the audio and other timed rendered classes.

Back to top

Media

Each media stream is represented by a CPlayerMedia class. This class is responsible for decoding data from the bytestream and passing it to a sync class for buffering and rendering. The media class really doesn't care too much whether it is audio, video or text - the majority of the code works for both.

Back to top

Bytestream

A bytestream is the mechanism that gives the Media a frame (or a number of bytes) to decode, as well as a timestamp for that frame. For example, an mp4 file bytestream would read each audio or video frame from an mp4 container file.

mp4player supports bytestreams for avi files, mp4 files, mpeg files, .mov files, some raw files, and RTP.

Some bytestreams need to be media aware (meaning they have to know something about the structure of the media inside of them). The mpeg file and some of the RTP bytestreams are media aware; the mp4 container file is not.

Each media will have its own bytestream. The bytestream base class resides in our_bytestream.h.

Back to top

Decode Plugin

Each media must have a way to translate from the encoded data to data that can be rendered (in the case of video, it is YUV data - for audio, it will be PCM - if you need to know what YUV or PCM is, do a web search).

With this in mind, we have the concept of a decode plugin that the media can use to take data from the bytestream and pass it to the sync routines. At startup, decode plugins are detected (rather than hard linked at compile time), and the proper decode plugin is detected as the media are created.

The decode plugin takes the encoded frame from the bytestream, decodes it and passes the raw data to the rendering buffer.

Back to top

Media Data Flow

The media data flow is fairly simple:

bytestream -> decoder -> render buffer -> rendering

Back to top

Threading

libmp4player uses multiple threads. Whatever control is used will have its own thread, or multiple threads.

Each media has a thread for decoding, and if it uses RTP as a bytestream, has a thread for RTP packet reception (except for RTP over RTSP, which will have 1 thread for all media).

SDL will start a thread for audio rendering.

Finally, there is a thread for synchronization of audio and video.

Back to top

Rendering and Synchronization

We use the open source package SDL for rendering - it is a multiple platform rendering engine for games. We have modified it slightly to get the audio latency for better synchronization. (Actually, we use the standard SDL routines, and copy the audio rendering into our own library, which we've added our audio latency code).

The decode plugins will have an interface to a sync class. There are 3 interfaces; to audio, to video and to text classes. There is a common base sync class (CSync) that all sync class derive from.

Audio will have a CAudioSync class, which contains the basic APIs from the plugin. Audio now has an intermediate base class that we use, located in audio_buffer.cpp and audio_buffer.h. This class will take an create a ring buffer from the decoded audio. It will then allow callbacks, making hardware specific calls to start/stop/pause. It does a bunch more than that, to verify that the audio is being received correctly, and will add/subtract samples to match output frequency clock drift.

Text and Video share a common class (CTimedSync) which is called from the sync task. CTimedSync (sync.h and timed_sync.cpp) contains information for ring buffers, interface from the sync class, and some statistics.

CTimedTextSync is derived from CTimedSync, and will interface with a CTextRenderer class. It will determine which derived class of CTextRenderer to use when it receives a configure call from the plugin.

CVideoSync is a bit more complex, deriving from CTimedSync, and CVideoApi. CVideoApi is a list of the APIs from the plugins, and some special APIs from the sync task. CSDLVideoSync derives from CVideoSync.

If one wanted, they could replace these functions with other rendering engines, like we did for audio on Mac OSX. Look at audio_macosx.cpp for more examples.

Rendering is done through the CPlayerSession sync thread. This is a fairly complex state machine when both audio and other timed media are being displayed. It works by starting the audio rendering, then using the audio latency to feedback the display time to the sync task for other rendering. The audio buffer driver uses a callback when it requires more data.

We synchronize the "display" times passed by the bytestream with the time of day from the gettimeofday function.

Back to top

Finally

To see the call flow to start and stop a session, you should look at main.cpp, function start_session().

Back to top

Valid HTML 4.01!