Friday, 31 January 2014

Autotune.NET

Autotune.NET

We've all cringed as a hopelessly out of tune contestant appears on the latest episode of “American Idol.” Occasionally, there's a contestant who manages to be pitch perfect all the way through—right until they flub the final note. And in the cutthroat world of televised auditions, sing one slightly flat note and you're out.
So what takes care of a bad-pitch day? Autotune—an effect that corrects the pitch of your voice so you'll never again sing out of tune. And now, with the power of modern microprocessors, autotune is possible in real-time, allowing singers to benefit from its almost magical powers during live concerts.
The company most famous for its autotune effect is Antares. Antares Auto-Tune currently retails for $249, and a stripped down version is available for $100. In addition to simply improving the pitch of a dodgy singer, autotune can be used to create unique robotic sounding vocal effects, a technique massively popular in recent years thanks to its use by artists such as T-Pain and the group behind the “Auto-Tune the News” YouTube videos. In 1998, when the effect was first used on Cher's “Believe” single, the producer used such extreme settings that instead of subtly adjusting the pitch, autotune “snapped” instantaneously to the nearest “correct” note.
Here is a nerdy example of what Autotune can do.

How does Autotune work?

An autotune effect has two parts. The first is pitch detection, which calculates the dominant frequency of the incoming signal, and is the reason autotune is normally used on monophonic audio sources (i.e. playing one note at a time, not whole chords). So, if your guitar is out of tune, you're out of luck (Celemony's Melodyne product, however, features some incredible capabilities for pitch-shifting polyphonic audio).
The second stage is pitch shifting, or “correcting” a given note. However, the bigger the pitch shift required, the more artificial the end result will be, and it is worth noting that absolutely perfect pitch is not always desirable. Sometimes the blended notes resulting from vibrato, for example, are an important part of the performance, and eliminating them would be detrimental.

Creating a .NET Autotune Algorithm

For this project, we will be creating an autotune effect for .NET. Serious recording enthusiasts should just go out and buy a decent autotune effect, but to have a little fun, we'll see if we can make an autotune that can give us a poor man's Cher effect (or T-Pain effect, if you prefer).
To get started, I searched to see if there were some pre-existing open source autotune implementations, which brought me to awesomebox, a project created by Ravi Parikh and Keegan Poppen while they were students at Stanford University. They kindly gave me permission to make use of their code, which uses an auto-correlator for pitch detection and an open source pitch-shifting algorithm from audio DSP expert Stephan M. Bernsee.

Porting C++ to C#

Although porting from C/C++ to C# is not exactly fun, there is enough similarity in the syntax that it is possible to complete without too many changes. You need to remember, however, that a long in C is an int in C# (i.e. 32 bits long not 64).
Additionally, the C# compiler is fussier than C/C++ when it comes to casting between floats, doubles, and ints. Putting the “f” suffix on numeric literals sorts out most of these compiler errors.
Pointers can be a pain. I tend to replace them with integer variables used to index into an array. You can of course use unsafe code, but that limits your options if you plan to port to Silverlight or Windows Phone 7 at a later date, neither of which allow unsafe code or interop into unmanaged code. The necessary mathematical functions are available in the System.Math class.
To see an example, compare this pitch shifting C++ source file with my C# conversion of it.

Capturing Audio with NAudio

Interop wrappers for the NAudio Windows WaveIn APIs capture the audio. Here is the code used to start recording:
c#:
waveIn = new WaveIn();
waveIn.DeviceNumber = recordingDevice;
waveIn.DataAvailable += waveIn_DataAvailable;
waveIn.RecordingStopped += new EventHandler(waveIn_RecordingStopped);
waveIn.WaveFormat = recordingFormat;
waveIn.StartRecording();
VB.Net:
waveIn = New WaveIn
waveIn.DeviceNumber = recordingDevice
AddHandler waveIn.DataAvailable, AddressOf waveIn_DataAvailable
AddHandler waveIn.RecordingStopped, AddressOf waveIn_RecordingStopped
waveIn.WaveFormat = _recordingFormat
waveIn.StartRecording()
The steps are:
  1. Create a new WaveIn device
  2. [Optional] set up the device number (0 for default recording device)
  3. Add a handler to the DataAvailable event—this is where we will receive the raw audio data
  4. Add a handler for the RecordingStopped event. This allows us to close the temporary WAV file we created
  5. Set up the recording format. For this project we are going to record in mono (i.e. one channel), 16 bit, 44.1kHz audio—the default setting for most microphones
  6. Call the StartRecording method
Whenever the soundcard reports a new buffer of recorded audio, we receive it in theDataAvailable event handler:
c#:
void waveIn_DataAvailable(object sender, WaveInEventArgs e)
{
    byte[] buffer = e.Buffer;
    int bytesRecorded = e.BytesRecorded;
    WriteToFile(buffer, bytesRecorded);

    for (int index = 0; index < e.BytesRecorded; index += 2)
    {
        short sample = (short)((buffer[index + 1] << 8) |
                                buffer[index + 0]);
        float sample32 = sample / 32768f;
        sampleAggregator.Add(sample32);
    }
}
VB.Net
Private Sub waveIn_DataAvailable(ByVal sender As Object, ByVal e As WaveInEventArgs)
    Dim buffer() = e.Buffer
    Dim bytesRecorded = e.BytesRecorded
    WriteToFile(buffer, bytesRecorded)

    For index = 0 To e.BytesRecorded - 1 Step 2
        Dim sample = CShort(buffer(index + 1)) << 8 Or CShort(buffer(index + 0))
        Dim sample32 = sample / 32768.0F
        _sampleAggregator.Add(sample32)
    Next index
End Sub
The WaveInEventArgs contains the number of bytes recorded (e.BytesRecorded) and a pointer to the buffer containing those bytes (e.Buffer). The handler does two things with the recorded data. First, it calls WriteToFile, which uses the WaveFileWriter class from NAudio to write the data to disk:
c#:
// before we start recording, set up a WaveFileWriter...
writer = new WaveFileWriter(waveFileName, recordingFormat);

// ... every block we receive we write it to the WaveFileWriter:
writer.WriteData(buffer, 0, bytesRecorded);

// ... and when recording stops we must call Dispose to finalize the
// .WAV file properly
writer.Dispose()
VB.Net:
writer = New WaveFileWriter(waveFileName, _recordingFormat)
writer.WriteData(buffer, 0, bytesRecorded)
writer.Dispose()

Converting Audio to Floating Point and Back Again

Once recording has completed, we have a WAV file on which to perform our autotune effect. However, our WAV file consists of 16 bit samples (i.e. System.Int16 aka short). In other words, we have a sequence of byte pairs, each of which represent a number in the range -32768 to 32767. For the digital signal processing we will be performing, it's best to have a sequence of floating point numbers (System.Single or float) in the range -1.0f to 1.0f. This is a common requirement, so NAudio provides a utility class to convert audio from short to float called Wave16ToFloatProvider. Here's the code that takes a WAV file and implements the autotune algorithm on it:
c#:
public static void ApplyAutoTune(string fileToProcess, string tempFile, AutoTuneSettings autotuneSettings)
{
    using (WaveFileReader reader = new WaveFileReader(fileToProcess))
    {
        IWaveProvider stream32 = new Wave16toFloatProvider(reader);
        IWaveProvider streamEffect = new AutoTuneWaveProvider(stream32, autotuneSettings);
        IWaveProvider stream16 = new WaveFloatTo16Provider(streamEffect);
        using (WaveFileWriter converted = new WaveFileWriter(tempFile, stream16.WaveFormat))
        {
            // buffer length needs to be a power of 2 for FFT to work nicely
            // however, make the buffer too long and pitches aren't detected fast enough
            // successful buffer sizes: 8192, 4096, 2048, 1024
            // (some pitch detection algorithms need at least 2048)
            byte[] buffer = new byte[8192]; 
            int bytesRead;
            do
            {
                bytesRead = stream16.Read(buffer, 0, buffer.Length);
                converted.WriteData(buffer, 0, bytesRead);
            } while (bytesRead != 0 && converted.Length < reader.Length);
        }
    }
}
VB.Net
Public Shared Sub ApplyAutoTune(ByVal fileToProcess As String,
    ByVal tempFile As String,
    ByVal autotuneSettings As AutoTuneSettings)
    Using reader As New WaveFileReader(fileToProcess)
        Dim stream32 As IWaveProvider = New Wave16ToFloatProvider(reader)
        Dim streamEffect As IWaveProvider = New AutoTuneWaveProvider(stream32, autotuneSettings)
        Dim stream16 As IWaveProvider = New WaveFloatTo16Provider(streamEffect)
        Using converted As New WaveFileWriter(tempFile, stream16.WaveFormat)
            ' buffer length needs to be a power of 2 for FFT to work nicely
            ' however, make the buffer too long and pitches aren't detected fast enough
            ' successful buffer sizes: 8192, 4096, 2048, 1024
            ' (some pitch detection algorithms need at least 2048)
            Dim buffer(8191) As Byte
            Dim bytesRead As Integer
            Do
                bytesRead = stream16.Read(buffer, 0, buffer.Length)
                converted.WriteData(buffer, 0, bytesRead)
            Loop While bytesRead <> 0 AndAlso converted.Length < reader.Length
        End Using
    End Using
End Sub
Here's how it works:
  1. First we use a WaveFileReader to open the file that we've just created, which contains 16 bit audio samples
  2. Then we use the Wave16ToFloatProvider to perform the conversion to floating point samples
  3. Next we pass it through our autotune effect (the AutotuneWaveProvider). We'll explain how this works this later
  4. Then we use the WaveFloatTo16Provider to convert back to 16 bit samples ready for saving to WAV (we could save a 32 bit WAV, but it would be rather wasteful of disk space)
  5. Having set up the audio pipeline, we can read from the WaveFloatTo16Provider and pull audio right through from the WAV file. We need to read in block sizes that are a power of 2, since we are passing the data through FFTs. If we want to read arbitrary block sizes, we need to introduce another element into our pipeline to buffer up enough data to pass through FFTs
  6. Finally, we use the WaveFileWriter to write the data we have read into a WAV file

The AutoTuneWaveProvider

As we saw in the last code snippet, the AutoTuneWaveProvider is the piece in our audio pipeline that actually performs the autotune effect. It implements the NAudioIWaveProvider interface, which allows it to be used in the pipeline for real-time playback if necessary, even though our example code is not doing this (see the section on performance later). Here's the AutoTuneWaveProvider constructor:
c#:
public AutoTuneWaveProvider(IWaveProvider source, AutoTuneSettings autoTuneSettings)
{
    this.autoTuneSettings = autoTuneSettings;
    if (source.WaveFormat.SampleRate != 44100)
        throw new ArgumentException("AutoTune only works at 44.1kHz");
    if (source.WaveFormat.Encoding != WaveFormatEncoding.IeeeFloat)
        throw new ArgumentException("AutoTune only works on IEEE floating point audio data");
    if (source.WaveFormat.Channels != 1)
        throw new ArgumentException("AutoTune only works on mono input sources");

    this.source = source;
    this.pitchDetector = new AutoCorrelator(source.WaveFormat.SampleRate);
    this.pitchShifter = new SmbPitchShifter(Settings);
    this.waveBuffer = new WaveBuffer(8192);
}
VB.Net
Public Sub New(ByVal source As IWaveProvider, ByVal autoTuneSettings As AutoTuneSettings)
    Me.autoTuneSettings = autoTuneSettings
    If source.WaveFormat.SampleRate <> 44100 Then
        Throw New ArgumentException("AutoTune only works at 44.1kHz")
    End If
    If source.WaveFormat.Encoding <> WaveFormatEncoding.IeeeFloat Then
        Throw New ArgumentException("AutoTune only works on IEEE floating point audio data")
    End If
    If source.WaveFormat.Channels <> 1 Then
        Throw New ArgumentException("AutoTune only works on mono input sources")
    End If

    Me.source = source
    Me.pitchDetector = New AutoCorrelator(source.WaveFormat.SampleRate)
    ' alternative pitch detector:
    ' Me.pitchDetector = New FftPitchDetector(source.WaveFormat.SampleRate)
    Me.pitchShifter = New SmbPitchShifter(Settings, source.WaveFormat.SampleRate)
    Me.waveBuffer = New WaveBuffer(8192)
End Sub
Some points to notice:
  1. We pass in a source IWaveProvider—this is where the data will be coming from
  2. We check that the source is in the right format—floating point mono input.
  3. We also pass in an AutoTuneSettings object. This not only encapsulates the settings for autotune, it is important if you want to adjust the settings in real-time while the effect is running
  4. We then create the two key components of our autotune effect: a pitch detector (which uses an autocorrelator), and a pitch shifter
  5. Finally we create a buffer to use for audio processing. This can be a byte[] array, but we use WaveBuffer from NAudio because it uses a clever trick that allows us to cast a byte[] into a float[] without using unsafe code or having to copy all of the data
The key method on any implementation of IWaveProvider is its Read method. This is where the audio consumer, usually the sound card or a WaveFileWriter, asks for data. The data must be supplied as a byte array, and if at all possible you should return exactly the number of bytes you were asked for (if you can't, an extra layer of buffering is usually required, or audio playback will be choppy). Here's our implementation of the Read method:
c#:
public int Read(byte[] buffer, int offset, int count)
{
    if (waveBuffer == null || waveBuffer.MaxSize < count)
    {
        waveBuffer = new WaveBuffer(count);
    }

    int bytesRead = source.Read(waveBuffer, 0, count);

    // the last bit sometimes needs to be rounded up:
    if (bytesRead > 0) bytesRead = count;

    int frames = bytesRead / sizeof(float); 
    float pitch = pitchDetector.DetectPitch(waveBuffer.FloatBuffer, frames);
        
    // an attempt to make it less "warbly" by holding onto the pitch 
    // for at least one more buffer
    if (pitch == 0f && release < maxHold)
    {
        pitch = previousPitch;
        release++;
    }
    else
    {
        this.previousPitch = pitch;
        release = 0;
    }
    
    WaveBuffer outBuffer = new WaveBuffer(buffer);

    pitchShifter.ShiftPitch(waveBuffer.FloatBuffer, pitch, 0.0f, outBuffer.FloatBuffer, frames);

    return frames * 4;
}
VB.Net:
Public Function Read(ByVal buffer() As Byte, ByVal offset As Integer,
                        ByVal count As Integer) As Integer Implements NAudio.Wave.IWaveProvider.Read
    If waveBuffer Is Nothing OrElse waveBuffer.MaxSize < count Then
        waveBuffer = New WaveBuffer(count)
    End If

    Dim bytesRead = source.Read(waveBuffer, 0, count)
    'Debug.Assert(bytesRead = count)

    ' the last bit sometimes needs to be rounded up:
    If bytesRead > 0 Then
        bytesRead = count
    End If

    'pitchsource->getPitches()
    Dim frames = bytesRead \ Len(New Single) ' MRH: was count
    Dim pitch = pitchDetector.DetectPitch(waveBuffer.FloatBuffer, frames)

    ' MRH: an attempt to make it less "warbly" by holding onto the pitch for at least one more buffer
    If pitch = 0.0F AndAlso release < maxHold Then
        pitch = previousPitch
        release += 1
    Else
        Me.previousPitch = pitch
        release = 0
    End If

    Dim midiNoteNumber = 40
    Dim targetPitch = CSng(8.175 * Math.Pow(1.05946309, midiNoteNumber))

    Dim outBuffer As New WaveBuffer(buffer)

    pitchShifter.ShiftPitch(waveBuffer.FloatBuffer, pitch, targetPitch, outBuffer.FloatBuffer, frames)

    Return frames * 4
End Function
Here's what's going on
  1. First we need to read from our source (in our case, a WAV file converted to floating point samples)
  2. If we get less data than we were expecting, we know that means we're at the end of the file, so we'll just pretend we got a full buffer
  3. We then work out how many audio ‘frames' are present (which is the same as the number of samples since this is mono audio). It's floating point audio, so frames equal bytes divided by four
  4. We then pass the data through our pitch detector algorithm (see below)
  5. Next we use some experimental code to stabilize pitch detection by reporting the previous frequency when no pitch is picked up
  6. Finally we pass the data into our pitch shifter, including details of the detected pitch
Now that we've seen the big picture of the AutotuneWaveProvider, let's drill down into its two main components—the pitch detector and pitch shifter.

Pitch Detection with Autocorrelation

The pitch detection part of autotune is vital to getting good results. If it can't accurately detect the input pitch, it will incorrectly calculate how much the pitch needs to be adjusted. However, high quality pitch detection is quite difficult to get right. First of all, the microphone may well pick up background noise. Second, when you sing a into a microphone, the signal consists not only of a single frequency, but also “harmonics” at different frequencies.
The good news is that we need to detect only the primary pitch.
The awesomebox algorithm makes use of “autocorrelation” for its pitch detection, but I made a few small tweaks to how the algorithm is implemented in an attempt to improve its accuracy. Autocorrelation has the advantage of being a relatively quick process. The basic principle is that if a signal is periodic, it will “correlate” well with itself when shifted forward (or backwards) one cycle.
Let's say we are looking to see if the note “Middle C” is being sung. The frequency of Middle C is around 262Hz. If we are sampling at 44.1kHz (which is standard for CD quality audio), then we will expect the signal to repeat at approximately every 168 samples (44100/262). Accordingly, for every sample in the buffer, we calculate the sum of squares of that sample and the sample 168 samples previous. We do this for every possible offset that measures a frequency in the range we want to detect (I am using 85Hz to 300Hz, which is adequate for pitch detecting vocals). The offset with the highest score is the most likely frequency.
Let's have a look at the code for an autocorrelation algorithm, starting with the constructor for the AutoCorrelator class:
c#:
public AutoCorrelator(int sampleRate)
{
    this.sampleRate = (float)sampleRate;
    int minFreq = 85;
    int maxFreq = 255;

    this.maxOffset = sampleRate / minFreq;
    this.minOffset = sampleRate / maxFreq;
}
VB.Net
Public Sub New(ByVal sampleRate As Integer)
    Me.sampleRate = CSng(sampleRate)
    Dim minFreq = 85
    Dim maxFreq = 255

    Me.maxOffset = sampleRate \ minFreq
    Me.minOffset = sampleRate \ maxFreq
End Sub
First of all, we pre-calculate some values based on the minimum and maximum frequencies we are looking for. Remember that lower frequencies are harder to detect than higher frequencies, so don't set minFreq too low. MaxOffset and MinOffset are the maximum and minimum backwards distances we will be seeking while looking for a match.
c#:
public float DetectPitch(float[] buffer, int frames)
{
    if (prevBuffer == null)
    {
        prevBuffer = new float[frames];
    }

    float maxCorr = 0;
    int maxLag = 0;

    // starting with low frequencies, working to higher
    for (int lag = maxOffset; lag >= minOffset; lag--)
    {
        float corr = 0; //  sum of squares
        for (int i = 0; i < frames; i++)
        {
            int oldIndex = i - lag;
            float sample = ((oldIndex < 0) ? prevBuffer[frames + 
            corr += (sample * buffer[i]);
        }
        if (corr > maxCorr)
        {
            maxCorr = corr;
            maxLag = lag;
        }

    }
    for (int n = 0; n < frames; n++)
    { 
        prevBuffer[n] = buffer[n]; 
    }
    float noiseThreshold = frames / 1000f;

    if (maxCorr < noiseThreshold || maxLag == 0) return 0.0f;
    return this.sampleRate / maxLag;
}
VB.Net
Public Function DetectPitch(ByVal buffer() As Single, ByVal frames As Integer) As Single Implements IPitchDetector.DetectPitch
    If prevBuffer Is Nothing Then
        prevBuffer = New Single(frames - 1){}
    End If
    Dim secCor As Single = 0
    Dim secLag = 0

    Dim maxCorr As Single = 0
    Dim maxLag = 0

    ' starting with low frequencies, working to higher
    For lag = maxOffset To minOffset Step -1
        Dim corr As Single = 0 ' this is calculated as the sum of squares
        For i = 0 To frames - 1
            Dim oldIndex = i - lag
            Dim sample = (If(oldIndex < 0, prevBuffer(frames + oldIndex), buffer(oldIndex)))
            corr += (sample * buffer(i))
        Next i
        If corr > maxCorr Then
            maxCorr = corr
            maxLag = lag
        End If
        If corr >= 0.9 * maxCorr Then
            secCor = corr
            secLag = lag
        End If
    Next lag
    For n = 0 To frames - 1
        prevBuffer(n) = buffer(n)
    Next n
    Dim noiseThreshold = frames / 1000.0F
    'Debug.WriteLine(String.Format("Max Corr: {0} ({1}), Sec Corr: {2} ({3})", Me.sampleRate / maxLag, maxCorr, Me.sampleRate / secLag, secCor))
    If maxCorr < noiseThreshold OrElse maxLag = 0 Then
        Return 0.0F
    End If
    'Return 44100.0f / secLag '--works better for singing
    Return Me.sampleRate / maxLag
End Function
A few things to notice:
  1. Notice that the audio comes in as an array of floating point numbers. NAudio performs this conversion from 16 bit audio for us by usingWave16ToFloatProvider
  2. We store the previous buffer. This allows us to look backwards for correlation
  3. We then work through each and every possible integer offset within our range, and calculate a correlation value
  4. The correlation is calculated as the sum of squares
  5. If it is the largest so far, we store the “lag” value (i.e. number of samples back that we correlated with)
  6. Notice that we return 0 (i.e. no frequency detected) if we don't find a strong frequency. This noise threshold may need to be tweaked depending on your input
  7. Finally, we convert into a frequency with the formula sampleRate / maxLag
I wrote some unit tests to measure the accuracy of detection with sine waves (which admittedly are the easiest to detect). Here are the results for audio sampled at 44.1kHz:
Test Frequency
Detected Pitch
109.99Hz
108.35Hz
116.53Hz
118.23Hz
123.46Hz
123.18Hz
130.80Hz
129.71Hz
138.58Hz
140.00Hz
146.82Hz
148.48Hz
155.55Hz
154.74Hz
164.80Hz
163.33Hz
174.60Hz
172.94Hz
184.98Hz
183.75Hz
195.98Hz
194.27Hz
207.63Hz
206.07Hz
219.98Hz
219.40Hz
233.06Hz
234.57Hz
246.92Hz
247.75Hz
261.60Hz
256.40Hz
277.16Hz
139.56Hz
293.64Hz
146.03Hz
Notice that the detected frequencies from the final two tests are actually half the correct amount. This doesn't actually matter for our purposes, since this just means the frequency has been detected as one octave below the correct note.
To improve on the accuracy of the autocorrelator's results, there are a couple of things you can do:
  • Run a band-pass filter before-hand, to remove any frequencies outside of the desired range
  • Combine the results with those obtained from a different technique, such as counting zero-crossings

Pitch Detection with the Fast Fourier Transform

I decided to implement an alternative pitch detection algorithm to see if I could get better results. A different approach is to use the Fast Fourier Transform, which converts signals from the “time domain” into the “frequency domain.”
The basic approach is to take a block of samples (which must be a power of 2 – e.g. 1024), and run the FFT on them. The FFT takes complex numbers as inputs, which for audio signals are entirely real. The implementation I am using expects real and complex parts interleaved for the input buffer. Here's our code setting up fftBuffer with interleaved samples:
c#:
private float[] fftBuffer;
private float[] prevBuffer;

public float DetectPitch(float[] buffer, int inFrames)
{
    Func<int, int, float> window = HammingWindow;
    if (prevBuffer == null)
    {
        prevBuffer = new float[inFrames];
    }
 
    // double frames since we are combining present and previous buffers
    int frames = inFrames * 2;
    if (fftBuffer == null)
    {
        fftBuffer = new float[frames * 2]; // times 2 because it is complex input
    }
 
    for (int n = 0; n < frames; n++)
    {
        if (n < inFrames)
        {
            fftBuffer[n * 2] = prevBuffer[n] * window(n, frames);
            fftBuffer[n * 2 + 1] = 0; // need to clear out as fft modifies buffer
        }
        else
        {
            fftBuffer[n * 2] = buffer[n-inFrames] * window(n, frames);
            fftBuffer[n * 2 + 1] = 0; // need to clear out as fft modifies buffer
        }
    }
VB.Net
Private fftBuffer() As Single
Private prevBuffer() As Single

Public Function DetectPitch(ByVal buffer() As Single,
                            ByVal inFrames As Integer) As Single Implements IPitchDetector.DetectPitch
    Dim window As Func(Of Integer, Integer, Single) = AddressOf HammingWindow
    If prevBuffer Is Nothing Then
        prevBuffer = New Single(inFrames - 1) {}
    End If

    ' double frames since we are combining present and previous buffers
    Dim frames = inFrames * 2
    If fftBuffer Is Nothing Then
        fftBuffer = New Single(frames * 2 - 1) {} ' times 2 because it is complex input
    End If

    For n = 0 To frames - 1
        If n < inFrames Then
            fftBuffer(n * 2) = prevBuffer(n) * window(n, frames)
            fftBuffer(n * 2 + 1) = 0 ' need to clear out as fft modifies buffer
        Else
            fftBuffer(n * 2) = buffer(n - inFrames) * window(n, frames)
            fftBuffer(n * 2 + 1) = 0 ' need to clear out as fft modifies buffer
        End If
    Next n
Notice that we prepend the previous buffer we were passed. This is a common way of increasing the accuracy and resolution of an FFT by using overlapping windows, and can be further extended to store three previous buffers, allowing us to have 75% overlapping windows instead of just the 50% that we have in this example.
For better peak frequency detection, the signal that is passed into the FFT is best pre-processed with a “windowing” function. There are several to choose from, each with its own strengths and weaknesses. I used the Hamming window, which is a fairly common choice:
c#:
private float HammingWindow(int n, int N) 
{
    return 0.54f - 0.46f * (float)Math.Cos((2 * Math.PI * n) / (N - 1));
}
VB.Net
Private Function HammingWindow(ByVal n As Integer, ByVal _N As Integer) As Single
    Return 0.54F - 0.46F * CSng(Math.Cos((2 * Math.PI * n) / (_N - 1)))
End Function
The next step is to pass on our interleaved buffer to the FFT algorithm. I am using Stephan Bernsee's here, though there is an alternative implementation in NAudio that I could have used. Since the same function can be used for an inverse FFT, the -1 parameter means (rather counter-intuitively), do a forwards FFT. It processes the data in place, which is fine since we don't need to keep the contents of the input buffer:
c#:
// assuming frames is a power of 2
SmbPitchShift.smbFft(fftBuffer, frames, -1);
VB.Net
' assuming frames is a power of 2
SmbPitchShift.smbFft(fftBuffer, frames, -1)
Once we have completed the FFT, we are ready to interpret its output. The output of the FFT consists of complex numbers (again real followed by imaginary in our buffer), which represent frequency “bins.”
We start off by calculating the bin size and working out which bins correspond to the range of frequencies we are interested in detecting:
c#:
float binSize = sampleRate / frames;
int minBin = (int)(85 / binSize);
int maxBin = (int)(300 / binSize);
VB.Net
Dim binSize = sampleRate / frames
Dim minBin = CInt(Fix(85 / binSize))
Dim maxBin = CInt(Fix(300 / binSize))
For example, if our sample rate is 44.1kHz and we analyse a block of 1024 samples, then each bin represents 43Hz, which is hardly the granularity we are looking for. To increase resolution, our options are to either sample at a higher rate or analyse a bigger chunk. Our approach is to use overlapping blocks of 8192 samples, as we read 4096 samples each time. This means we have a resolution of around 5Hz, which is much more acceptable.
Now we can calculate the magnitude or “intensity” for each frequency by calculating the sum of squares (strictly we should then take the square root, but we don't need to since we are just looking for the largest value):
c#:
float maxIntensity = 0f;
int maxBinIndex = 0;

for (int bin = minBin; bin <= maxBin; bin++)
{
    float real = fftBuffer[bin * 2];
    float imaginary = fftBuffer[bin * 2 + 1];
    float intensity = real * real + imaginary * imaginary;
    if (intensity > maxIntensity)
    {
        maxIntensity = intensity;
        maxBinIndex = bin;
    }
}
VB.Net
Dim maxIntensity = 0.0F
Dim maxBinIndex = 0
For bin = minBin To maxBin
    Dim real = fftBuffer(bin * 2)
    Dim imaginary = fftBuffer(bin * 2 + 1)
    Dim intensity = real * real + imaginary * imaginary
    If intensity > maxIntensity Then
        maxIntensity = intensity
        maxBinIndex = bin
    End If
Next bin
Since we have identified the bin with the maximum intensity, we can calculate the detected frequency:
c#:
return binSize * maxBinIndex;
VB.Net
Return binSize * maxBinIndex
I don't currently specify a minimum threshold for maxIntensity, but perhaps if it were very low, the FFT pitch detector would return zero to indicate no pitch detected instead of returning an answer that is probably not accurate.
Let's have a look at how the FFT pitch detector does:
Test Frequency
Detected Pitch
109.99Hz
107.67Hz
116.53Hz
118.43Hz
123.46Hz
123.82Hz
130.80Hz
129.20Hz
138.58Hz
139.97Hz
146.82Hz
145.35Hz
155.55Hz
156.12Hz
164.80Hz
166.88Hz
174.60Hz
172.27Hz
184.98Hz
183.03Hz
195.98Hz
193.80Hz
207.63Hz
209.95Hz
219.98Hz
220.72Hz
233.06Hz
231.48Hz
246.92Hz
247.63Hz
261.60Hz
263.78Hz
277.16Hz
274.55Hz
293.64Hz
296.08Hz
As can be seen, it correctly picks out the primary frequencies of the higher notes, but overall it doesn't get that much closer than the autocorrelator, so I've left that as the default algorithm. You, however, can swap in the FFT detector in the code if it works better for the material you are auto-tuning.
There are ways of using the phase information from the FFT output to increase the accuracy of pitch detection even further, but I have left that as an exercise for the reader!

Pitch Shifting

The next step is to determine how much we will shift the pitch. The simplest way to do this is to look for the musical pitch that is closest to the detected pitch. Then, the amount of the shift by is simply the ratio of those two notes.
There are, however, some additional considerations. First, we may want to select a subset of musical notes that are acceptable. For example, only notes in the key of C#, or maybe F# minor pentatonic. This may require a slightly more radical adjustment.
Second, depending on the effect we are after, we may not want to instantaneously jump to the new frequency. The code I am using utilizes a fairly rudimentary “attack” time parameter, allowing you to gradually move to the new frequency.
The actual DSP for the pitch-shifting effect is more or less untouched from Stephan Bernsee's code, and this is because it works really well. The Bernsee's code makes use of the Fast Fourier Transform, plus a bunch of clever mathematics, which I almostunderstand, but not quite well enough to try and explain here! You're better off reading an article in which the man himself explains how it works.
The class that manages the pitch-shifting algorithm is called SmbPitchShifer and inherits from a PitchShifter base class. This does the bulk of its work in the ShiftPitchfunction:
c#:
public void ShiftPitch(float[] inputBuff, float inputPitch,
                       float targetPitch, float[] outputBuff, int nFrames)
{
     UpdateSettings();
     detectedPitch = inputPitch;
VB.Net
Public Sub ShiftPitch(ByVal inputBuff() As Single, ByVal inputPitch As Single,
                    ByVal targetPitch As Single, ByVal outputBuff() As Single, ByVal nFrames As Integer)
    UpdateSettings()
    detectedPitch = inputPitch
The inputPitch parameter is set to the frequency detected by the PitchDetector. ThetargetPitch parameter is currently unused, but will be used to specify the target pitch in real-time when accepting input from, say, a MIDI keyboard. In any case, we callUpdateSettings in order to see if any of the autotune algorithm settings have changed since last time.
Next we calculate the amount we need to shift the pitch shift. A shift factor of 1 means no change. We don't allow the shift factor to go above 2 or below 0.5, since these figures represent a whole octave change:
c#:
float shiftFactor = 1.0f;

if (inputPitch > 0)
{
    shiftFactor = snapFactor(inputPitch);
}

if (shiftFactor > 2.0) shiftFactor = 2.0f;
if (shiftFactor < 0.5) shiftFactor = 0.5f;
VB.Net
Dim shiftFactor = 1.0F

If inputPitch > 0 Then
   shiftFactor = snapFactor(inputPitch)
   shiftFactor += addVibrato(nFrames)
End If

If shiftFactor > 2.0 Then
   shiftFactor = 2.0F
End If
If shiftFactor < 0.5 Then
   shiftFactor = 0.5F
End If
The decision of what the target note is takes place in the snapFactor function:
c#:
protected float snapFactor(float freq)
{
    float previousFrequency = 0.0f;
    float correctedFrequency = 0.0f;
    int previousNote = 0;
    int correctedNote = 0;
    for (int i = 1; i < 120; i++)
    {
        bool endLoop = false;
        foreach (int note in this.settings.AutoPitches)
        {
            if (i % 12 == note)
            {
                previousFrequency = correctedFrequency;
                previousNote = correctedNote;
                correctedFrequency = (float)(8.175 * Math.Pow(1.05946309, (float)i));
                correctedNote = i;
                if (correctedFrequency > freq) { endLoop = true; }
                break;
            }
        }
        if (endLoop)
        {
            break;
        }
    }
    if (correctedFrequency == 0.0) { return 1.0f; }
    int destinationNote = 0;
    double destinationFrequency = 0.0;
    // decide whether we are shifting up or down
    if (correctedFrequency - freq > freq - previousFrequency)
    {
        destinationNote = previousNote;
        destinationFrequency = previousFrequency;
    }
    else
    {
        destinationNote = correctedNote;
        destinationFrequency = correctedFrequency;
    }
    if (destinationNote != currPitch)
    {
        numElapsed = 0;
        currPitch = destinationNote;
    }
    if (attack > numElapsed)
    {
        double n = (destinationFrequency - freq) / attack * numElapsed;
        destinationFrequency = freq + n;
    }
    numElapsed++;
    return (float)(destinationFrequency / freq);
}
VB.Net:
Protected Function snapFactor(ByVal freq As Single) As Single
    Dim previousFrequency = 0.0F
    Dim correctedFrequency = 0.0F
    Dim previousNote = 0
    Dim correctedNote = 0
    For i = 1 To 119
        Dim endLoop = False
        For Each note As Integer In Me.settings.AutoPitches
            If i Mod 12 = note Then
                previousFrequency = correctedFrequency
                previousNote = correctedNote
                correctedFrequency = CSng(8.175 * Math.Pow(1.05946309, CSng(i)))
                correctedNote = i
                If correctedFrequency > freq Then
                    endLoop = True
                End If
                Exit For
            End If
        Next note
        If endLoop Then
            Exit For
        End If
    Next i
    If correctedFrequency = 0.0 Then
        Return 1.0f
    End If
    Dim destinationNote = 0
    Dim destinationFrequency = 0.0
    ' decide whether we are shifting up or down
    If correctedFrequency - freq > freq - previousFrequency Then
        destinationNote = previousNote
        destinationFrequency = previousFrequency
    Else
        destinationNote = correctedNote
        destinationFrequency = correctedFrequency
    End If
    If destinationNote <> currPitch Then
        numElapsed = 0
        currPitch = destinationNote
    End If
    If attack > numElapsed Then
        Dim n = (destinationFrequency - freq) / attack * numElapsed
        destinationFrequency = freq + n
    End If
    numElapsed += 1
    Return CSng(destinationFrequency / freq)
End Function
The way this function works is that it runs through the MIDI notes 0-120 and, if that note is selected as one of the valid pitches we support, we remember the “corrected frequency,” which can be calculated from the MIDI note number with the following formula:
c#:
correctedFrequency = (float)(8.175 * Math.Pow(1.05946309, (float)midiNoteNumber));
VB.Net
correctedFrequency = CSng(8.175 * Math.Pow(1.05946309, CSng(i)))
Obviously, a pitch is likely to fall somewhere in between two valid notes, so we choose the which pitch to correct by determining which one is closest to the detected frequency.
The snapFactor function is also responsible for implementing the attack time parameter. This allows the destinationFrequency to be slowly moved to the target note over the duration of the attack period. Having calculated our shift factor, we are now ready to pass our data on to the actual pitch-shifting algorithm:
c#:
int fftFrameSize = 2048;
int osamp = 8; // 32 is best quality
SmbPitchShift.smbPitchShift(shiftFactor, nFrames, fftFrameSize, osamp, this.sampleRate, inputBuff, outputBuff);
VB.Net
Dim fftFrameSize = 2048
Dim osamp = 8 ' 32 is best quality
SmbPitchShift.smbPitchShift(shiftFactor, nFrames, fftFrameSize, osamp, Me.sampleRate, inputBuff, outputBuff)
The final thing we do in the ShiftPitch function is keep a record of the pitch shifts we have made. These are stored in a queue (maximum of 5000 entries) and are very useful for diagnosing what is going on if you are not getting the results you wanted from the algorithm:
c#:
shiftedPitch = inputPitch * shiftFactor;
updateShifts(detectedPitch, shiftedPitch, this.currPitch);
VB.Net
shiftedPitch = inputPitch * shiftFactor
updateShifts(detectedPitch, shiftedPitch, Me.currPitch)

Performance

Performance, as you might expect in a managed application that has not been extensively optimized, was not good. Using my laptop, I could autotune one minute of audio in about 90 seconds. Obviously, that rules out real-time autotuning. I decided to profile the application to see if there were any quick ways I could improve things.
The profiling tools in Visual Studio revealed that 20% of the time was spent on pitch detection and 80% pitch shifting. Unfortunately, there were not too many options available for optimisation, since further investigation pointed to calls to Math.Sin taking the bulk of the time. Possibly creating lookup tables could save a bit more time.
Fortunately, we have another option for speeding things up. The pitch-shifting algorithm takes an “oversampling” parameter, which by default is set to 32, the highest value. However, we can trade off speed for quality. Setting it to 16 meant that I could autotune a minute of audio in 55ms (on my 2.4GHz Core2Duo laptop) – realtime but only just. Setting it to 8 reduced that down to 36s. The results still sounded reasonable, so I have left it set at 8 in the code.
An alternative way of speeding it up though would be to swap in a different pitch-shifting algorithm. You could start by trying out one I created as part of the Skype Voice Changer project previously featured on Coding4Fun, which is also able to operate in real-time (although I haven't done any quality comparisons).

Creating a Test GUI

Rather than starting from scratch, I decided to build upon .NET Voice Recorder, a WPF application I created for a previous Coding4Fun article. This takes advantage of theNAudio .NET audio library for audio recording and playback capabilities. The GUI has three screens. On the first is the input device used for recording. The second records a short voice clip. And the third allows you to edit a small portion of saved audio.
Here's a screenshot of the second screen showing a recording in progress:
image
And here's the screen that allows you to trim the recording, preview it, and save it as WAV:
image
As you can see, I have added a new button allowing access to the autotune effect settings. On this screen, you can select which notes are valid, and you can also adjust the “attack time” if you prefer to not go for the robotic effect. I've included a drop-down menu that automatically selects the appropriate notes from various keys.
image
When you click “Apply,” the autotune effect is applied (while you wait on a background thread) and then you are returned to the screen, allowing you to play back your recording and see how it sounds. If you'd like, you can then go back and change the autotune settings (or turn it off).

MVVM Light

The original VoiceRecorder application used a MVVM (model-view-viewmodel) architecture for binding data to each view. I have updated it to make use of Laurent Bugnion's excellentMVVM Light library. This removes the need for my own RelayCommand and ViewModelBase classes, and also enables me to replace my ViewManager with a more extensible framework using the event aggregator (“Messenger”) that is included with MVVM light. This allows me to quickly navigate from one view to another by sending out a message on the event aggregator:
c#:
private void NavigateToSaveView()
{
   Messenger.Default.Send(new NavigateMessage(SaveViewModel.ViewName, this.voiceRecorderState));
}
VB.Net
Private Sub NavigateToSaveView()
    Messenger.Default.Send(New NavigateMessage(SaveViewModel.ViewName, Me.voiceRecorderState))
End Sub

Getting the Best out of Autotune

Unfortunately, Autotune is an effect that doesn't always produce the desired result . Obviously, if you want great autotune, you're best off buying a commercial implementation, but here are a few tips for getting the most out of an autotune algorithm:
  • Get a good quality recording. Avoid background noise, hum, and too quiet or too loud (distorted) recordings. If you need to sing against a backing track, play it in headphones so the microphone doesn't pick it up.
  • Know what key you are singing in. This is where the test application won't help you out, since if you don't know what key you are singing in, you can hardly expect to be able to select the appropriate key from the list. You might even be singing in between two keys! If you have an in-tune musical instrument at hand, play a note to give yourself a starting pitch.
  • Choose your scale. The easiest option is to just go with a chromatic scale, which means all 12 notes are valid. However, you can try to make the autotune force you into a monotone. Pentatonic scales are useful for instant gratification. They have five notes in them, and so long as you stick to those five, almost anything you sing will fit in with a backing track in your chosen key.
  • Adjust the attack time. An attack time of zero is great for the robot effect. A longer attack time will smooth out transitions.
  • Why is it “warbling”? With this autotune algorithm, it is quite common to get a “warbling” effect. This is because it is either changing its mind about what note to pitch shift to (because the pitch detector is not providing stable pitch detection), or because the pitch detector couldn't determine a pitch at all, so your voice isn't being pitch shifted. You can play around with the release slider (and may need to modify the release algorithm) if you want to eliminate warbling.

Taking it further

.NET Voice Recorder is open source and hosted on CodePlex in a Mercurial repository. So what you waiting for? Make a fork and have a go at improving it:
  • Improve the pitch detector algorithm. I have already given some suggestions for how this can be done. One idea that might be worth experimenting with is making it “hold” the detected frequency for a short period until a strong, new frequency is detected.
  • Display the detected pitches. The autotune effect stores data of the detected pitches as well as the pitches it attempts to convert to. You might display this information to the user, perhaps underneath the waveform, so they can see what it is detecting. (It currently outputs this information with Debug.WriteLine, which is useful for debugging purposes).
  • Suggest an appropriate scale? Instead of leaving it to the user to select what key to snap the notes to, how about auto-selecting the detected notes?
  • Allow direct input of desired pitch. Instead of letting the autotune effect try to work out what note it should be targeting at any given time, it is far more effective to let the user input what note they would like to shift to. This could be done by entering the note using a MIDI keyboard in real-time, or by drawing the notes in, perhaps on a “piano-roll” control. Or, users could implement a simple, domain-specific language to specify the desired note. For example:
    • 0:00.0 C#
    • 0:01.5 E
    • 0:02.7 G#
  • Port it to Windows Phone 7. This would be quite a cool effect to have on your phone. Of course, you might need to optimize performance a bit more to save battery life. And you'll want to put your graphic designer hat on to give it a more beautiful look than it currently has.