[<-back]

Audio Recording

There's more than you can do with audio than just hit play. This tutorial will cover some of the basics audio programming recording and playback.

Also, make sure you're using the latest version of SDL. I had to upgrade to SDL 2.0.8 to get audio recording to work properly.

Audio data

To understand how audio recording works, it helps to understand how audio data works.

This is a song (with too much gain):

Now, let's zoom in on it:

As you can see, sound is a wave. A wave can be represented by a sequence of values (or in the case of this stereo song, two sequences with one for each sound wave). Playing audio is just sending a sequence of values to the audio driver and recording audio is copying a sequence of values from the audio driver.

Here are a couple constants we'll be using.

First there's a constant defining that we'll only support up to 10 recording devices to choose from (we only need one for this program to work). Then we have the maximum time we allow for recording and the maximum time we can store in the buffer. We'll be recording for 5 seconds, but we allow for 6 seconds of recording for the sake of padding in case the application records 5.1 seconds or so.

Lastly we have a set of enumerations for the different states in the program. First the user can select a recording device. Then after the user selects a device, the user is stopped waiting to start the recording. Then the user starts the recording for 5 seconds. After the user has finished recording, they can either start playback or record again. If anything results in an error, it will go in an error state.

Typically in these tutorials we go down the source code explaining what everything means along the way, but in this tutorial we are going to jump around more by flow of execution. It makes things easier to understand than just going the down source code. So don't get lost as we jump around in the source file.

//  Maximum number of supported recording devices
const int MAX_RECORDING_DEVICES = 10;

//  Maximum recording time
const int MAX_RECORDING_SECONDS = 5;

//  Maximum recording time plus padding
const int RECORDING_BUFFER_SECONDS = MAX_RECORDING_SECONDS + 1;

//  The various recording actions we can take
enum RecordingState
{
    SELECTING_DEVICE,
    STOPPED,
    RECORDING,
    RECORDED,
    PLAYBACK,
    ERROR
};

These two callbacks are going to be doing the actual recording to and playing from our audio buffer. We'll get into details on how these work later.

//  Recording/playback callbacks
void audioRecordingCallback ( void* userdata, Uint8* stream, int len );
void audioPlaybackCallback  ( void* userdata, Uint8* stream, int len );

Here we have some textures. One to prompt the user to let them know what's going on and another array of textures to store the names of the recording devices. We also have an integer to keep track of how many devices we have available.

We also have two SDL_AudioSpec variables. An SDL_AudioSpec is an audio specification which basically defines how audio is recorded or played back. When we open an audio device for recording or playing, we request a specification but we may not get what we requested back because the audio driver does not support it. This is why we are going to store the specification we get back from the driver for recording and playback.

//  Prompt texture
LTexture gPromptTexture;

//  The text textures that specify recording device names
LTexture gDeviceTextures[ MAX_RECORDING_DEVICES ];

//  Number of available devices
int gRecordingDeviceCount = 0;

//  Recieved audio spec
SDL_AudioSpec gReceivedRecordingSpec;
SDL_AudioSpec gReceivedPlaybackSpec;

The "gRecordingBuffer" is a buffer of unsigned bytes that'll store our audio data. "gBufferByteSize" will store how many bytes the buffer will hold. "gBufferBytePosition" controls where we are in the buffer during recording or playback. "gBufferByteMaxPosition" controls the maximum piece of the buffer we will be using.

If that's confusing, remember that "gBufferByteSize" is 6 seconds of bytes (5 seconds + 1 second of padding) and "gBufferByteMaxPosition" is 5 seconds of bytes we'll be using.

//  Recording data buffer
Uint8*  gRecordingBuffer    = NULL;

//  Size of data buffer
Uint32  gBufferByteSize     = 0;

//  Position in data buffer
Uint32  gBufferBytePosition = 0;

//  Maximum position in data buffer for recording
Uint32  gBufferByteMaxPosition  = 0;

Make sure to remember to initialize audio before recording or playback. It's an easy thing to forget.

    //  Initialize SDL
    if  ( SDL_Init( SDL_INIT_VIDEO | SDL_INIT_AUDIO ) < 0 )
    {
        printf( "SDL could not initialize! SDL Error: %s\n", SDL_GetError() );
        success = false;
    }

After loading the font and rendering the initial prompt message we get the number of available recording devices using SDL_GetNumAudioDevices. When you pass in a SDL_TRUE argument it will give us the number of recording devices. With SDL_FALSE, it will give us playback devices.

If there isn't at least one recording devices connected, we error out of the function.

bool loadMedia()
{
    //  Loading success flag
    bool success = true;

    //  Open the font
    gFont = TTF_OpenFont( "./lazy.ttf", 28 );
    if  ( gFont == NULL )
    {
        printf( "Failed to load lazy font! SDL_ttf Error: %s\n", TTF_GetError() );
        success = false;
    }
    else
    {
        //  Set starting prompt 
        gPromptTexture.loadFromRenderedText( "Select your recording device:", gTextColor );

        //  Get capture device count
        gRecordingDeviceCount = SDL_GetNumAudioDevices( SDL_TRUE );

        //  No recording devices
        if  ( gRecordingDeviceCount < 1 )
        {
            printf( "Unable to get audio capture device! SDL Error: %s\n", SDL_GetError() );
            success = false;
        }

If there are recording devices connected, we cap the number we use to 10 (which may dissapoint those with 11 microphones hooked up to their PC) and then go through the devices rendering their names to a texture. We get the device name using SDL_GetAudioDeviceName and passing in the fact that we want recording device names with SDL_TRUE and the index of the recording device.

        //  At least one device connected
        else
        {
            //  Cap recording device count
            if  ( gRecordingDeviceCount > MAX_RECORDING_DEVICES )
            {
                gRecordingDeviceCount = MAX_RECORDING_DEVICES;
            }

            //  Render device names
            std::stringstream promptText;
            for ( int i = 0; i < gRecordingDeviceCount; ++i )
            {
                //  Get capture device name
                promptText.str( "" );
                promptText << i << ": " << SDL_GetAudioDeviceName( i, SDL_TRUE );

                //  Set texture from name
                gDeviceTextures[ i ].loadFromRenderedText( promptText.str().c_str(), gTextColor );
            }
        }
    }

    return success;
}

In the main function after initializing and loading we set the initial recording state and declare two audio device IDs which are just integers to represent the recording and playback devices.

            //  Main loop flag
            bool quit = false;

            //  Event handler
            SDL_Event e;

            //  Set the default recording state
            RecordingState currentState = SELECTING_DEVICE;

            //  Audio device IDs
            SDL_AudioDeviceID   recordingDeviceId = 0;
            SDL_AudioDeviceID    playbackDeviceId = 0;

In the event handling loop we have a switch statement that handles the different states. When the user presses 0-9, we convert it to an index which is easy because the SDLK constants are sequential and can be converted by subtracting the keysym by SDLK_0.

                    //  Do current state event handling
                    switch ( currentState )
                    {
                        //  User is selecting recording device
                        case SELECTING_DEVICE:

                            //  On key press
                            if  ( e.type == SDL_KEYDOWN )
                            {
                                //  Handle key press from 0 to 9 
                                if  ( e.key.keysym.sym >= SDLK_0 && e.key.keysym.sym <= SDLK_9 )
                                {
                                    //  Get selection index
                                    int index = e.key.keysym.sym - SDLK_0;

If the user pressed a valid index key, we then specify the recording audio spec.

First we initialize the audio spec with SDL_zero. Always initialize memory before using it. Ask the SREs that had to deal with the heartbleed bug what happens when you don't.

We set the frequency to 44.1 khz which is CD quality. We're using 32bit floating point format for the data. We have 2 channels since we want stereo. Samples are set to 4096 because that's a pretty standard size. Lastly we give it the audio recording callback.

With the spec set, we call SDL_OpenAudioDevice and pass in the recording device name, the fact that we want a recording device with SDL_TRUE, the spec we want to have, a pointer to the spec we get back from the driver, and lastly a flag that says we're ok with SDL_OpenAudioDevice giving us a different format than we requested.

                                    //  Index is valid
                                    if  ( index < gRecordingDeviceCount )
                                    {
                                        //  Default audio spec
                                        SDL_AudioSpec desiredRecordingSpec;
                                        SDL_zero(desiredRecordingSpec);
                                        desiredRecordingSpec.freq       = 44100;
                                        desiredRecordingSpec.format     = AUDIO_F32;
                                        desiredRecordingSpec.channels   = 2;
                                        desiredRecordingSpec.samples    = 4096;
                                        desiredRecordingSpec.callback   = audioRecordingCallback;

                                        //  Open recording device
                                        recordingDeviceId =
                                            SDL_OpenAudioDevice(
                                                SDL_GetAudioDeviceName( index, SDL_TRUE )   ,
                                                SDL_TRUE                                    ,
                                                  &desiredRecordingSpec                     ,
                                                &gReceivedRecordingSpec                     ,
                                                SDL_AUDIO_ALLOW_FORMAT_CHANGE
                                            );

If we get no device ID, we go to an error state. If the device opened successfully, we create a playback spec that's mostly the same as the recording spec. The major difference is that it uses the playback callback instead of the recording callback.

Opening the playback device is also mostly the same. For this tutorial, we don't care which playback device we get so we pass in NULL to grab the first available one. Secondly, we pass in SDL_FALSE for the second argument to open up a playback device instead of a recording device.

                                        //  Device failed to open
                                        if  ( recordingDeviceId == 0 )
                                        {
                                            //  Report error
                                            printf( "Failed to open recording device! SDL Error: %s", SDL_GetError() );
                                            gPromptTexture.loadFromRenderedText( "Failed to open recording device!", gTextColor );
                                            currentState = ERROR;
                                        }
                                        //  Device opened successfully
                                        else
                                        {
                                            //  Default audio spec
                                            SDL_AudioSpec   desiredPlaybackSpec;
                                            SDL_zero(desiredPlaybackSpec);
                                            desiredPlaybackSpec.freq        = 44100;
                                            desiredPlaybackSpec.format      = AUDIO_F32;
                                            desiredPlaybackSpec.channels    = 2;
                                            desiredPlaybackSpec.samples     = 4096;
                                            desiredPlaybackSpec.callback    = audioPlaybackCallback;

                                            //  Open playback device
                                            playbackDeviceId =
                                                SDL_OpenAudioDevice(
                                                    NULL                            ,
                                                    SDL_FALSE                       ,
                                                      &desiredPlaybackSpec          ,
                                                    &gReceivedPlaybackSpec          ,
                                                    SDL_AUDIO_ALLOW_FORMAT_CHANGE
                                                );

If we get no playback device ID, we go to an error state. If the device opened successfully, we create a byte buffer to to hold the audio data we'll be recording and playing back.

To calculate how much space we need first we need to calculate the bytes per sample. If we have 2 channels and 32 bits per channel sample (which we can get using SDL_AUDIO_BITSIZE on the audio format), we'll get 2 channels * ( 32 bits / 8 bits per byte ) which is 8 bytes per sample.

To get the bytes per second, we multiply the bytes per sample times the frequency which is the number sample per second. 8 bytes per sample * 44 100 samples per seconds get us 705600 bytes per second.

We want to have 6 seconds of buffer (5 seconds + 1 second of padding) so we set the buffer size to be 4 233 600 bytes. That seems like a lot, but it's a litle more than 4 megabytes. Remember, because of the max position, we only use 5 seconds of the 6 second buffer.

After calculating the buffer size, we allocate the buffer and initialize it with memset. Finally, we set the prompt texture and move on to the next state.

                                            //  Device failed to open
                                            if  ( playbackDeviceId == 0 )
                                            {
                                                //  Report error
                                                printf( "Failed to open playback device! SDL Error: %s", SDL_GetError() );
                                                gPromptTexture.loadFromRenderedText( "Failed to open playback device!", gTextColor );
                                                currentState = ERROR;
                                            }
                                            //  Device opened successfully
                                            else
                                            {
                                                //  Calculate per sample bytes
                                                int bytesPerSample = gReceivedRecordingSpec.channels * ( SDL_AUDIO_BITSIZE( gReceivedRecordingSpec.format ) / 8 );

                                                //  Calculate bytes per second
                                                int bytesPerSecond = gReceivedRecordingSpec.freq * bytesPerSample;

                                                //  Calculate buffer size
                                                gBufferByteSize = RECORDING_BUFFER_SECONDS * bytesPerSecond;

                                                //  Calculate max buffer use
                                                gBufferByteMaxPosition = MAX_RECORDING_SECONDS * bytesPerSecond;

                                                //  Allocate and initialize byte buffer
                                                gRecordingBuffer = new Uint8[ gBufferByteSize ];
                                                memset( gRecordingBuffer, 0, gBufferByteSize );

                                                //  Go on to next state
                                                gPromptTexture.loadFromRenderedText("Press 1 to record for 5 seconds.", gTextColor);
                                                currentState = STOPPED;
                                            }
                                        }
                                    }
                                }
                            }
                            break;

After we've allocated the buffer, we're ready to start recording. If the user presses 1, we set the buffer position back to 0 and unpause the audio device using SDL_PauseAudioDevice. The first argument is the device we want to pause/unpause and the second argument determines whether we want to pause or unpause. Passing SDL_FALSE will unpause a device.

Audio devices are paused be default meaning they will not record or play until you unpause them. If you're wondering why your callback isn't doing anything, this may be why.

                        //  User getting ready to record
                        case STOPPED:

                            //  On key press
                            if  ( e.type == SDL_KEYDOWN )
                            {
                                //  Start recording
                                if  ( e.key.keysym.sym == SDLK_1 )
                                {
                                    //  Go back to beginning of buffer
                                    gBufferBytePosition = 0;

                                    //  Start recording
                                    SDL_PauseAudioDevice( recordingDeviceId, SDL_FALSE );

                                    //  Go on to next state
                                    gPromptTexture.loadFromRenderedText( "Recording...", gTextColor );
                                    currentState = RECORDING;
                                }
                            }
                            break;

When the recording device is unpaused it will start calling the recording callback we gave it at regular intervals.

As you can see it doesn't do much. All it does is copy bytes from the device stream into the current position in our recording buffer and then move the position in the buffer. That is all recording is, just grabbing chunks of audio data. Just remember that "len" is the size of the chunk from the stream in bytes.

void audioRecordingCallback( void* userdata, Uint8* stream, int len )
{
    //  Copy audio from stream
    memcpy( &gRecordingBuffer[ gBufferBytePosition ], stream, len );

    //  Move along buffer
    gBufferBytePosition += len;
}

Here we're jumping to the update part of the main loop. When we're recording, we need to check whether we've filled the 5 seconds of buffer. Before we can check the buffer position we have to call SDL_LockAudioDevice. The thing is, the callback is being run in another thread and we don't want to have two threads accessing the same variable at the same time. SDL_LockAudioDevice stops the callback from being called while we need to access the buffer position which the callback also manipulates.

Once the recording device is locked, we check if the buffer position is past 5 seconds of data. If it is, we pause the recording device to halt recording and we move on to the next state. Lastly, we call SDL_UnlockAudioDevice so if there's still data to record the recording device can continue.

This is a very simple example of multithreading. If you would like to know more on the subject you can check tutorials on multithreading, semaphores, and mutexes.

                //  Updating recording
                if  ( currentState == RECORDING )
                {
                    //  Lock callback
                    SDL_LockAudioDevice( recordingDeviceId );

                    //  Finished recording
                    if  ( gBufferBytePosition > gBufferByteMaxPosition )
                    {
                        //  Stop recording audio
                        SDL_PauseAudioDevice( recordingDeviceId, SDL_TRUE );

                        //  Go on to next state
                        gPromptTexture.loadFromRenderedText( "Press 1 to play back. Press 2 to record again.", gTextColor );
                        currentState = RECORDED;
                    }

                    //  Unlock callback
                    SDL_UnlockAudioDevice( recordingDeviceId );
                }

Here we're jumping back to event handling for after we recorded 5 seconds.

As you can see, it's similar to when we started recording. We set the buffer position back to the beginning, unpause the playback device, and set the next state.

                        //  User has finished recording
                        case RECORDED:

                            //  On key press
                            if  ( e.type == SDL_KEYDOWN )
                            {
                                //  Start playback
                                if  ( e.key.keysym.sym == SDLK_1 )
                                {
                                    //  Go back to beginning of buffer
                                    gBufferBytePosition = 0;

                                    //  Start playback
                                    SDL_PauseAudioDevice( playbackDeviceId, SDL_FALSE );

                                    //  Go on to next state
                                    gPromptTexture.loadFromRenderedText( "Playing...", gTextColor );
                                    currentState = PLAYBACK;
                                }

The playback callback is also similar to the recording callback. The key difference here is that instead of copying from the device to the buffer, we're taking the data we recorded in the buffer and copying it back to the device.

void audioPlaybackCallback( void* userdata, Uint8* stream, int len )
{
    //  Copy audio to stream
    memcpy( stream, &gRecordingBuffer[ gBufferBytePosition ], len );

    //  Move along buffer
    gBufferBytePosition += len;
}

Updating playback is again similar to recording. We lock the playback device, check the playback position, stop playback if the buffer position is past the end point, and unlock the playback device.

                //  Updating playback
                else if ( currentState == PLAYBACK )
                {
                    //  Lock callback
                    SDL_LockAudioDevice( playbackDeviceId );

                    //  Finished playback
                    if  ( gBufferBytePosition > gBufferByteMaxPosition )
                    {
                        //  Stop playing audio
                        SDL_PauseAudioDevice( playbackDeviceId, SDL_TRUE );

                        //  Go on to next state
                        gPromptTexture.loadFromRenderedText( "Press 1 to play back. Press 2 to record again.", gTextColor );
                        currentState = RECORDED;
                    }

                    //  Unlock callback
                    SDL_UnlockAudioDevice( playbackDeviceId );
                }

Jumping back to event handling for after the user has recorded, we allow for the user to rerecord. When we want to record again, we just jump the buffer position back to the beginning, initialize the buffer, and unpause the recording device.

                                //  Record again
                                if  ( e.key.keysym.sym == SDLK_2 )
                                {
                                    //  Reset the buffer
                                    gBufferBytePosition = 0;
                                    memset( gRecordingBuffer, 0, gBufferByteSize );

                                    //  Start recording
                                    SDL_PauseAudioDevice( recordingDeviceId, SDL_FALSE );

                                    //  Go on to next state
                                    gPromptTexture.loadFromRenderedText( "Recording...", gTextColor );
                                    currentState = RECORDING;
                                }
                            }
                            break;
                    }

As always, don't forget to deallocate the buffer after using it to prevent memory leaks.

Now, audio programming is a very large field which I am by no means an expert in. However, now that you know how to handle raw audio data, you can look into using audio libraries and frameworks to do more complicated things like audio compression and voice chat.

void close()
{
    //  Free textures
    gPromptTexture.free();
    for ( int i = 0; i < MAX_RECORDING_DEVICES; ++i )
    {
        gDeviceTextures[ i ].free();
    }

    //  Free global font
    TTF_CloseFont( gFont );
    gFont = NULL;

    //  Destroy window    
    SDL_DestroyRenderer( gRenderer );
    SDL_DestroyWindow( gWindow );
    gWindow     = NULL;
    gRenderer   = NULL;

    //  Free playback audio
    if  ( gRecordingBuffer != NULL )
    {
        delete[] gRecordingBuffer;
        gRecordingBuffer = NULL;
    }

    //  Quit SDL subsystems
    TTF_Quit();
    IMG_Quit();
    SDL_Quit();
}

[<-back]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!