Wave files have the ability to contain Cue Points, a.k.a. Markers, a.k.a. Sync Points. Cue points are locations in the audio data which can be used to create loops, or non-sequential playlists. The FMOD Event API (for the playback of events created with the FMOD Designer application) has a neat feature where it will trigger a callback function when playing Wave data and reaching a cue point. However, unlike the lower level FMOD Ex API, there is no way to programatically add cue points to an event or a sound definition - the cue points must be present in the wave files before they are added to a Sound Bank in FMOD Designer.
For a recent project I needed to make use of these cue point callbacks in an FMOD Designer. I was disappointed to realise that Pro Tools doesn’t have the option of embedding its markers in exported wave files as cue points. So I had a look around for software that does this, and it seems the list is pretty short. On the Windows side there is Sound Forge and on Mac there is Triumph. Both are large, full featured audio packages, and not cheap, and I didn’t feel it was much of a value proposition when I only wanted to use one tiny feature. So I decided just to write some code to do it myself.
I sometimes jump around between Mac & Windows on interactive audio projects, so I decided to write the code in plain, portable C so that I could compile and use it on both platforms. Giving that I was going that far, I decided just to make the code completely platform agnostic, which raised a few challenges but made for an interesting side project for a couple of days! In this post I’ll talk through the internals of a wave file and the code I wrote to add cue points. The full code is available on github here.
The Wave File Format
The Wave file format’s roots go back to the IFF format developed by Electronic Arts back in the Amiga days (anyone remember Deluxe Paint?). The Wave format is an implementation of the RIFF file structure, which is Microsoft’s version of IFF. The difference between the two is that in IFF files bytes are laid out in big-endian format (which was the native format of the Amiga’s CPU), whereas RIFF files are laid out in little-endian format (which was, and is, the format of PCs’ Intel CPUs). The notion of big-endian vs little-endian is central to the platform-agnostic nature of the code shown below, so if you are not clear on these concepts there are good explanations around the web and on Wikipedia.
In RIFF files, the data is stored in chunks. A chunk consists of 2 parts, a header section and a data section. The header specifies the type of data in the chunk, and the number of bytes in the data section. The data section can then contain any data in any format. If an application reading a RIFF file encounters a chunk type that it doesn’t recognise, it can ignore it by jumping over the data section and moving on to the next chunk.
Specifically, in a chunk header the Chunk ID consists of 4 ASCII characters and the data size is specified by a 4 byte (32bit) unsigned integer, in little-endian format. The data section can be a variable number of bytes, as specified in the header. Chunks in RIFF files must be 2-byte aligned, i.e. each chunk must start on an even numbered byte. Therefore if the number of bytes in the data section is odd, a padding byte must be appended to the end of the data. However, it is important to note that if there is a padding byte present, it is *not* included in the data size value in the chunk header.

The ‘root’ chunk in a RIFF file has the chunk ID of “RIFF”, and the data size value equal to the file size minus the 8 bytes for the chunk header. The data section of the RIFF chunk starts with 4 bytes of ASCII describing the contents of the file, followed by a series of other chunks appropriate to that file type
Wave files in particular begin with the ‘root’ RIFF Chunk, and the 4 bytes at the beginning of the root chunk’s data section are the characters “WAVE”. There are numerous further types of chunk that can be contained within a wave file, but at a minimum there should be a format chunk describing the format of the sample data, and a data chunk containing the actual audio samples.
The format chunk has the id “fmt ” (note the space as the 4th character). Its data section contains a compression code, indicating the compression used in the sample data. A value of 1 indicates uncompressed LPCM data. A list of common compression codes can be found on here. The data section also includes along with sample rate, bit depth, number of channels, average bytes per second and bytes per frame. Some of the more complex compressed audio formats may need additional information for decoding, so after the ‘standard’ values in the format chunk, there can be an optional variable-length section of further data. If the data size in the format chunk’s header is greater than 16 (the number of bytes required for the ‘standard’ format data), then there will be dataSize - 16 bytes of additional data at the end of the format chunk. If the number of extra bytes is odd, there will be a padding byte added to the end of the chunk.

The data chunk contains the actual audio sample data. This chunk has the ID of “data”, and the chunk’s data section contains the samples in the format specified in the format chunk. Again, a padding byte will be added to the end of the data chunk if the length of the sample data is odd.

Cue Points are stored in a Cue chunk. The cue chunk has the ID “cue ” (again, note the trailing space). Its data section consists of a number indicating how many cues it contains followed by a series of Cue Point ‘sub chunks’. Each cue point has a unique identifier; a play order position that can be used with a separate Playlist chunk to create non-sequential playback options; a data chunk ID, which is the id of the chunk where the sample data for this cue point resides; a chunk start value which is used if the sample data is not in a standard “data”; a block start value which indicates bow many bytes of compressed data must be read before sample values can be decompressed; and a frame offset value which is the actual position of the cue point - if the audio is mono then 1 frame contains 1 sample; in stereo 1 frame contains 1 sample for the left channel and one for the right, and so on.

The code we will look at below shows adding cue points to LPCM wave files only (for clarity’s sake) so the values for each cue point are simple:
- Cue Point ID: the index of the cue point
- Play Order Position: 0 (i.e. no playlist)
- Data Chunk ID: “data”
- Chunk Start: 0 (we are using a standard ‘data’ chunk)
- Block Start: 0 (uncompressed sample data needs no pre-reading
- Frame Offset: the position of our cue point.
There are numerous other types of chunk that can be found in a wave file, some standard, some standard-ish and some non-standard - as to be expected with a media format as old as this one. Some other chunks are useful, such as the Label and Note chunks which can associate text with cues, and others are just weird, such as the “JUNK” chunk I’ve seen in a few files (I think this comes from ProTools). But as we will do in the code below, programs can just ignore any chunks that they don’t recognise, which also means application specific meta data can be added to wave files without breaking compatibility.
Representing Chunks in Code
The code we will look at below will read a wave file, a file containing marker locations, and write out a new wave file containing cue points for each marker. As we go through the code to read and write wave files there will be some chunks that we will want to inspect and manipulate and others that we don’t care about. For simplicity and readable code, we start by defining structs to represent the chunks we will be manipulating:
typedef struct {
char chunkID[4]; // Must be "RIFF"
char dataSize[4];
char riffType[4]; // Must be "WAVE"
} WaveHeader;
typedef struct {
char chunkID[4]; // String: must be "fmt "
char chunkDataSize[4];
char compressionCode[2];
char numberOfChannels[2];
char sampleRate[4];
char averageBytesPerSecond[4];
char blockAlign[2];
char significantBitsPerSample[2];
} FormatChunk;
typedef struct {
char chunkID[4]; // String: Must be "cue "
char chunkDataSize[4];
char cuePointsCount[4];
CuePoint *cuePoints;
} CueChunk;
typedef struct {
char cuePointID[4];
char playOrderPosition[4];
char dataChunkID[4];
char chunkStart[4];
char blockStart[4];
char frameOffset[4];
} CuePoint;
Chunks that we will not manipulate directly can just be copied over from the source wave file into the output file. Therefore for these ‘generic’ chunks, we can just keep a note of their location in the source file with a struct like this:
typedef struct {
long startOffset; // in bytes
long size; // in bytes
} ChunkLocation;
Endianness
As noted above, all the integer data in a Wave file is in little-endian format. Since Window’s PC’s and current Mac computers all use intel-derived CPUs, they all store data in memory as little endian. However, as I was writing this code to be portable, I decided to go all the way, and not assume that the platform running the code would be little endian. You will see that all the fields struct definitions for chunks above are just char (i.e. byte) arrays. Whenever we read data into one of these structs from a file or write one of these structs to a file, the data must explicitly be in little endian format. When we want to actually manipulate the data from one of these structs we will explicitly create a host-CPU endian integer variable by way of a function that will convert little-endian bytes to a host endian integer if necessary. Therefore in the code you will see these four functions:
uint32_t littleEndianBytesToUInt32(char littleEndianBytes[4]);
void uint32ToLittleEndianBytes(uint32_t uInt32Value,
char out_LittleEndianBytes[4]);
uint16_t littleEndianBytesToUInt16(char littleEndianBytes[2]);
void uint16ToLittleEndianBytes(uint16_t uInt16Value,
char out_LittleEndianBytes[2]);
The implementation is just some simple byte-reordering, which I won’t go into here, but which is included in the full source code.
Reading Wave Files
Onto the main course: code. First up is opening our input wave file and checking it really is a wave file:
FILE *inputFile = fopen(inFilePath, "rb");
WaveHeader *waveHeader = (WaveHeader *)malloc(sizeof(WaveHeader));
fread(waveHeader, sizeof(WaveHeader), 1, inputFile);
if (strncmp(&(waveHeader->chunkID[0]), "RIFF", 4) != 0)
{
fprintf(stderr, "Input file is not a RIFF file\n");
goto CleanUpAndExit;
}
if (strncmp(&(waveHeader->riffType[0]), "WAVE", 4) != 0)
{
fprintf(stderr, "Input file is not a WAVE file\n");
goto CleanUpAndExit;
}
In the above, and through out the rest of the code samples, I’ll skip over error checking for brevity.
Next we will read through the remainder of the input file and identify the different chunks that it contains, by reading the Chunk ID of the next chunk we encounter:
while (1)
{
char nextChunkID[4];
fread(&nextChunkID[0], sizeof(nextChunkID), 1, inputFile);
if (feof(inputFile))
{
break;
}
We are interested in format chunks, data chunks and any existing cue chunks. By convention format chunks should be the first chunk, but you never know. When we find a format chunk, we can check we have uncompressed audio:
if (strncmp(&nextChunkID[0], "fmt ", 4) == 0)
{
// Elsewhere declare FormatChunk *formatChunk
formatChunk = (FormatChunk *)malloc(sizeof(FormatChunk));
// Skip back to the start of the chunk
fseek(inputFile, -4, SEEK_CUR);
fread(formatChunk, sizeof(FormatChunk), 1, inputFile);
if (littleEndianBytesToUInt16(formatChunk->compressionCode) != 1)
{
fprintf(stderr, "Compressed audio formats are not supported\n");
goto CleanUpAndExit;
}
Although the extra format information at the end of a format chunk isn’t required for uncompressed audio, some applications may stick some data in there anyway. So we check for that and any padding byte that may come after it:
uint32_t extraFormatBytesCount =
littleEndianBytesToUInt32(formatChunk->chunkDataSize) - 16;
if (extraFormatBytesCount > 0)
{
formatChunkExtraBytes.startOffset = ftell(inputFile);
formatChunkExtraBytes.size = extraFormatBytesCount;
fseek(inputFile, extraFormatBytesCount, SEEK_CUR);
if (extraFormatBytesCount % 2 != 0)
{
fseek(inputFile, 1, SEEK_CUR);
}
}
}
When we find a data chunk, we just keep track of its position:
else if (strncmp(&nextChunkID[0], "data", 4) == 0)
{
// Declare elsewhere ChunkLocation dataChunkLocation
dataChunkLocation.startOffset = ftell(inputFile) - sizeof(nextChunkID);
char sampleDataSizeBytes[4];
fread(sampleDataSizeBytes, sizeof(char), 4, inputFile);
uint32_t sampleDataSize = littleEndianBytesToUInt32(sampleDataSizeBytes);
dataChunkLocation.size = sizeof(nextChunkID)
+ sizeof(sampleDataSizeBytes)
+ sampleDataSize;
// Skip to the end of the chunk.
fseek(inputFile, sampleDataSize, SEEK_CUR);
if (sampleDataSize % 2 != 0)
{
fseek(inputFile, 1, SEEK_CUR);
}
}
If we find and existing cue chunk, i.e.:
else if (strncmp(&nextChunkID[0], "cue ", 4) == 0)
{
...
}
We could go ahead and stick its details into a CueChunk struck if we want to try to merge existing cue points with the ones we will add, but for now we will just ignore it. For any other chunks that we find we will just keep a note of their location in the input file.
else
{
// Declared elsewhere:
// const int maxOtherChunks = 256 or whatever;
// int otherChunksCount = 0;
// ChunkLocation otherChunkLocations[maxOtherChunks] = {{0}};
otherChunkLocations[otherChunksCount].startOffset = ftell(inputFile)
- sizeof(nextChunkID);
char chunkDataSizeBytes[4] = {0};
fread(chunkDataSizeBytes, sizeof(char), 4, inputFile);
uint32_t chunkDataSize = littleEndianBytesToUInt32(chunkDataSizeBytes);
otherChunkLocations[otherChunksCount].size = sizeof(nextChunkID)
+ sizeof(chunkDataSizeBytes)
+ chunkDataSize;
// Skip over the chunk's data, and any padding byte
fseek(inputFile, chunkDataSize, SEEK_CUR);
if (chunkDataSize % 2 != 0)
{
fseek(inputFile, 1, SEEK_CUR);
}
otherChunksCount++;
}
Assuming our input file contained at least a format chunk and a data chunk, we can go ahead in read in the cue point locations. For this, I assumed that the locations would be presented in a plain text file with one location per line (as this is reasonably easy to generate from Pro Tools’ “Export Session Info” feature). However, portability becomes an issue here again, with different platforms using different end-of-line characters in plain text files. OS X and Unix use the newline character ‘\n’, where as Windows uses a carriage return character followed by a newline character ‘\r\n’. For completenesses sake we will also accommodate the classic Mac’s case, which uses a single carriage return character ‘\r’.
To read in the cue point locations we can through each character in the file, adding any numeric characters to a string. When we hit the end of the line, we convert the string to an integer and start over:
FILE *markersFile = fopen(markerFilePath, "rb");
uint32_t cueLocations[1000] = {0};
int cueLocationsCount = 0;
while (!feof(markersFile))
{
char cueLocationString[11] = {0};
// Max Value for a 32 bit int is 4,294,967,295,
// i.e. 10 numeric digits, so char[11] should be
// enough storage for all the digits in a line
// plus a terminator (\0).
int charIndex = 0;
// Loop through each line int the markers file
while (1)
{
char nextChar = fgetc(markersFile);
if (feof(markersFile))
{
cueLocationString[charIndex] = '\0';
break;
}
// check for end of line
if (nextChar == '\r')
{
// This is a Classic Mac line ending '\r'
// or the start of a Windows line ending '\r\n'
// If this is the start of a '\r\n', gobble up
//the '\n' too
char peekAheadChar = fgetc(markersFile);
if ((peekAheadChar != EOF) && (peekAheadChar != '\n'))
{
ungetc(peekAheadChar, markersFile);
}
cueLocationString[charIndex] = '\0';
break;
}
if (nextChar == '\n')
{
// This is a Unix/ OS X line ending '\n'
cueLocationString[charIndex] = '\0';
break;
}
if ( (nextChar == '0') || (nextChar == '1') ||(nextChar == '2')
||(nextChar == '3') ||(nextChar == '4') ||(nextChar == '5')
||(nextChar == '6') ||(nextChar == '7') ||(nextChar == '8')
||(nextChar == '9'))
{
// This is a regular numeric character, if there are less than
// 10 digits in the cueLocationString, add this character.
// More than 10 digits is too much for a 32bit unsigned
// integer, so ignore this character and spin through the loop
// until we hit EOL or EOF
if (charIndex < 10)
{
cueLocationString[charIndex] = nextChar;
charIndex++;
}
}
}
// Convert the digits from the line to a uint32 and add to cueLocations
if (strlen(cueLocationString) > 0)
{
long cueLocation_Long = strtol(cueLocationString, NULL, 10);
if (cueLocation_Long < UINT32_MAX)
{
cueLocations[cueLocationsCount] = (uint32_t)cueLocation_Long;
cueLocationsCount++;
}
}
}
We then create cue point structs for each location and add them to a cue chunk.
// Create CuePointStructs for each cue location
CuePoint *cuePoints = malloc(sizeof(CuePoint) * cueLocationsCount);
for (uint32_t i = 0; i < cueLocationsCount; i++)
{
uint32ToLittleEndianBytes(i + 1, cuePoints[i].cuePointID);
uint32ToLittleEndianBytes(0, cuePoints[i].playOrderPosition);
cuePoints[i].dataChunkID[0] = 'd';
cuePoints[i].dataChunkID[1] = 'a';
cuePoints[i].dataChunkID[2] = 't';
cuePoints[i].dataChunkID[3] = 'a';
uint32ToLittleEndianBytes(0, cuePoints[i].chunkStart);
uint32ToLittleEndianBytes(0, cuePoints[i].blockStart);
uint32ToLittleEndianBytes(cueLocations[i], cuePoints[i].sampleOffset);
}
// Populate a CueChunk Struct
CueChunk cueChunk;
cueChunk.chunkID[0] = 'c';
cueChunk.chunkID[1] = 'u';
cueChunk.chunkID[2] = 'e';
cueChunk.chunkID[3] = ' ';
uint32ToLittleEndianBytes(4 + (sizeof(CuePoint) * cueLocationsCount),
cueChunk.chunkDataSize);
uint32ToLittleEndianBytes(cueLocationsCount,
cueChunk.cuePointsCount);
cueChunk.cuePoints = cuePoints;
We are now ready to write out the new wave file. As the length of the file is specified in the root RIFF chunk’s header, we need to calculate the length up front:
FILE *outputFile = fopen(outFilePath, "wb");
// Update the file header chunk to have the new data size
uint32_t fileDataSize = 0;
fileDataSize += 4; // the 4 bytes for the Riff Type "WAVE"
fileDataSize += sizeof(FormatChunk);
fileDataSize += formatChunkExtraBytes.size;
if (formatChunkExtraBytes.size % 2 != 0)
{
fileDataSize++; // Padding byte for 2byte alignment
}
fileDataSize += dataChunkLocation.size;
if (dataChunkLocation.size % 2 != 0)
{
fileDataSize++;
}
for (int i = 0; i < otherChunksCount; i++)
{
fileDataSize += otherChunkLocations[i].size;
if (otherChunkLocations[i].size % 2 != 0)
{
fileDataSize ++;
}
}
fileDataSize += 4; // 4 bytes for CueChunk ID "cue "
fileDataSize += 4; // UInt32 for CueChunk.chunkDataSize
fileDataSize += 4; // UInt32 for CueChunk.cuePointsCount
fileDataSize += (sizeof(CuePoint) * cueLocationsCount);
uint32ToLittleEndianBytes(fileDataSize, waveHeader->dataSize);
// Write out the header to the new file
fwrite(waveHeader, sizeof(*waveHeader), 1, outputFile);
We next write all our chunks out to the output file. To keep the code simple and readable, we use a little helper function to copy chunks we haven’t modified from the input file to the output file:
int writeChunkLocationFromInputFileToOutputFile(ChunkLocation chunk,
FILE *inputFile,
FILE *outputFile);
This function just copies over chunk data from the input file in 1MB pieces. You can see the implementation in the full source code listing. Although chunks can appear in any order in a wave file, it makes sense to put all the metadata chunks first and the data chunk at the end - as this means a program can get all the information it needs to start playback before the whole file is loaded - useful if the file is large and being streamed over a network. So we next write out the format chunk, the cue chunk and any other chunks we came across in the input file, and finally the data chunk.
// Write out the format chunk
fwrite(formatChunk, sizeof(FormatChunk), 1, outputFile);
if (formatChunkExtraBytes.size > 0)
{
writeChunkLocationFromInputFileToOutputFile(formatChunkExtraBytes,
inputFile,
outputFile)
if (formatChunkExtraBytes.size % 2 != 0)
{
fwrite("\0", sizeof(char), 1, outputFile) < 1);
}
}
// Write out the start of new Cue Chunk: chunkID, dataSize and cuePointsCount
size_t writeSize = sizeof(cueChunk.chunkID)
+ sizeof(cueChunk.chunkDataSize)
+ sizeof(cueChunk.cuePointsCount);
fwrite(&cueChunk, writeSize, 1, outputFile);
// Write out the Cue Points
uint32_t cuePointsCount =
littleEndianBytesToUInt32(cueChunk.cuePointsCount)
for (uint32_t i = 0; i < cuePointsCount; i++)
{
fwrite(&(cuePoints[i]), sizeof(CuePoint), 1, outputFile);
}
// Write out the other chunks from the input file
for (int i = 0; i < otherChunksCount; i++)
{
writeChunkLocationFromInputFileToOutputFile(otherChunkLocations[i],
inputFile,
outputFile);
if (otherChunkLocations[i].size % 2 != 0)
{
fwrite("\0", sizeof(char), 1, outputFile)
}
}
// Write out the data chunk
writeChunkLocationFromInputFileToOutputFile(dataChunkLocation,
inputFile,
outputFile)
if (dataChunkLocation.size % 2 != 0)
{
fwrite("\0", sizeof(char), 1, outputFile);
}
And that’s it, we now have a wave file with embedded cue points that we can use for FMOD or any other such application. The full source code is available here (public domain). You could wrap it up inside a GUI, use it in an audio app, or just compile it and run it from the command line.
One point about sample data that may cause some confusion is that when samples are represented with 8-bits, they are specified as unsigned values. All other sample bit-sizes are specified as signed values. For example a 16-bit sample can range from -32,768 to +32,767 with a mid-point (silence) at 0.
- Good tip about the Wave file format from http://www.sonicspot.com/guide/wavefiles.html
(Source: sonicspot.com)
Hi Jim, Sorry to sound like a n00b but I really need some clarity with regard to fmod. I have been learning fmod with the help of online resources. I would like to know what happens when I handover the .fsb files to the programmer? What if the programmer is using Unity? Will he still be able to integrate the fmod build into his project? So when I build a project on fmod, is the build universal to all programming languages?
Hi,
When you build your project in FMOD Designer you will output at least 2 files: an FSB file, which contains all the audio data; and an FEV file which contains the definitions and parameters of your events, effects, etc. The game must have the FMOD Ex audio engine in order to read and ‘play back’ the content of the FEV & FSB files. The audio programmer or integration programmer will write code using the FMOD Ex API (Application Programming Interface) that reads your files and triggers the events, effects & properties, etc at the appropriate points during gameplay. When you build your project in FMOD Designer you can optionally produce a notes file and a programmer’s header file, which will make it easier for the programmer to understand the content of your design. (But no substitute for face-to-face discussions!)
Therefore, the game engine/development platform used for the game will need to either include FMOD engine integration, or have the ability to add the FMOD Ex libraries to the code. Last time I worked with Unity, it did not have native support for FMOD integration. If the game is being built with this platform then the programmer will have to manually integrate the FMOD libraries.
The FMOD Ex API has C and C++ interfaces. There is also a C# wrapper available to integrate FMOD into managed code projects on Windows. There may be other 3rd party wrappers available for further programming languages.
Hope this helps,
Jim
This is my first kinect based project. It is essentially a ‘matrix sequencer’, inspired by Yamaha’s Tenori-On. The shape and position of the player’s body is mapped onto a grid of lights. A set of sounds or instrument samples can be loaded into each row in the grid, and each column represents a musical beat in a bar. As the metronome cycles through each beat, any lights that are turned on in that beat’s column trigger the playback of the sound associated with the light’s row.
The video shows it in action.
Up to six different people can be tracked by the Kinect and mapped onto the grid, and any wav file can be loaded into a channel. As you can see in the video, the grid can be orientated with either the sounds in rows and beats in columns, or vice-versa. Depending on the sounds used and the number of people, the different orientations can produce better sounding music. If you would like to give it a try, there is a download link at the end of this article.
The game uses the Microsoft Kinect for Windows SDK to get player tracking data from a Kinect Sensor. This data is used to map the player shapes onto the Grid UI, which is put together with WPF. Sound playback is done with the FMOD Ex API, and the code is written in C#. The rest of this article describes how this is all put together.
The Kinect
The Kinect sensor, when paired with the Kinect for Windows SDK, provides three streams of visual/spacial data that an app/game can make use of: Color Stream, which is the regular video stream; Depth Stream, which provides a grayscale image stream where each pixel’s value represents the distance of the object at that pixel from the sensor; and Skeleton Stream, which provides joint coordinates in three dimensions for up to two players. When both the depth and skeleton streams are enabled, there are three bits at the end of each depth pixel value that indicate whether the pixel is part of a player (person) and the sensor can identify and differentiate up to six people simultaneously. It is this player identification that Kinectori-On makes use of.
I found the Kinect SDK surprising straightforward. When the game has detected a Kinect sensor, (which I won’t cover here, as setup is covered well in the Microsoft documentation), enabling the two streams needed for player identification is done with two simple calls:
Both the Color and Depth streams’ Enable() methods take a format parameter, where the frame size and rate for the stream can be set. This allows for trade offs between the workload of processing the stream data and having high quality imagery. In this case, the player shapes are being mapped onto what is effectively a 16 x 16 pixel image (the grid), so here I’ve stuck with the lowest resolution available for the Depth stream: 320 x 240.
A point to note here is that the code above enables both the Depth stream and the Skeleton stream, but only registers for ‘frame ready’ callbacks for the Depth stream. This is because the Skeleton stream needs to be enabled for the Depth image pixel data to include the bits that indicate player identification. However, the actually skeleton data (player and joint positions in 3D) is not required for this project, so it is just ignored. The frame ready callbacks are called when the sensor can provide the next full frame of data for the stream. In the case of the Color and Depth streams, this means receiving a single image representing the frame.
The mapping of players onto the grid is done in a method called from within the Depth Frame Ready callback method:
In this method the Depth frame image is effectively stretched to the aspect ratio of the grid in the UI (referred to as the ControlGrid in the code). The image is then divided into rows and columns to match the grid, in this case 16 x 16. In each subdivided piece of the image the pixels are examined to determine if they represent a player, which is done by bitwise ANDing the pixel value with a constant provided by the SDK. The result will be between 0 and 6, where 0 indicates the pixel is not part of a player, and the other values identify to which player the pixel belongs. If any one player represents more that 25% of the pixels in a grid square, the corresponding ‘light’ in the UI is turned on with a colour representing that player.
Playing the Audio with FMOD
I looked into a few options for getting sound playback for this project, as this is the first real C#/WPF/.NET project I’ve worked on. Initially I tried .Net’s own System.Media.Soundplayer class, but it just isn’t robust enough for a sequencer type project. Whatever system I used had to work in managed code, therefore DShow, which I’d used previously, and Direct X were out. Managed Direct X looked really promising until I realized it had been deprecated in favour XNA. The time I had to finish this project was limited, so I didn’t want to start learning XNA. In the end I went with FMOD which is a C/C++ library, but has a managed C# wrapper, and I was already pretty familiar with its API.
Getting FMOD linked with a managed C# project is a little different than with other platforms. The C# wrapper is made up of the files fmod.cs, fmod_dsp.cs, fmod_memoryinfo.cs and fmod_errors.cs (which are included with the Windows install of the API), and the first step is to get these into the project with Visual Studio’s Project > Add Existing Item… menu item. There is no need to explicitly point the linker to the fmod.dll library. Instead, it is just added to the project with Project > Add Existing Item… To have it copied to the executable’s directory, select fmod.dll in the Solution Explorer, then in the properties window, set the Copy to Output property to Copy if Newer.
The music playback is implemented with a single MusicPlayer class which has 5 public methods: setWavFilePathForSampleIndex(), clearWavFilePathForSampleIndex(), Start(), Stop() and Reset(). The first 2 methods are called when the user selects audio files to use or removes audio files in the UI. The remaining methods, obviously, start, stop and reset the music playback.
When the MusicPlayer is started, a DispatherTimer instance is setup to fire at the rate specified by the BPM field in the UI. When the timer fires the MusicPlayer queries the Window Controller for the state of each cell in main grid UI for the column for the current beat. For each active cell, the music player begins playback of the corresponding sound.
The first time a sound is played, a new FMOD.Sound instance is created with the relevant sound file. To minimize disk access and memory allocation, the instances are kept around after they have finished playing, in order to be reused when possible. Therefore, when sound playback is required, the MusicPlayer first looks in it’s store for an inactive (i.e. not currently playing) FMOD.Sound instance for the required sound. If it finds one it is re-played, otherwise a new FMODSound instance is created. The path for the sound file is stored in the FMODSound’s userData property, and when the playback of the sound is completed this path is used to check if the sound is still valid, i.e. the user hasn’t loaded another sound into that slot in the UI. If it is still valid, the FMODSound instance is stored for re-use.
This was a interesting little project to work on, and actually playing with the Kinectori-On is quite fun! If you would like to try it out, you can download the full source code in a Visual Studio project from github at https://github.com/jimmcgowan/Kinectori-On
I’ve been pretty fascinated with the Kinect since it came out, and despite all the grumblings from a lot of gamers, I’ve enjoyed jumping around my living room making a fool of myself with it. So when the Kinect For Windows and the official Kinect SDK came out, I got really excited about the possibilities of using Kinect in some projects. I’m hoping to develop a performance piece at the HKAPA where myself and some lighting, sound and video designers can collaborate with some dancers and choreographers, and make use of the Kinect and some other game technologies. I’m really interested in developing a system that can track dancers in real time, analyse and understand their movements, and control various lighting, sound, and/or video design elements in a way that removes the need for the traditional linear sequence of cues and events that theatrical performances are usually slaved to. This would give the performers freedom to improvise, reorganize and otherwise alter their performance as they are performing, and the design elements would seamlessly follow them (no frantic stage managers and operators trying to keep up, and no ‘lowest common denominator’ designs to ‘simplify’ the operational process).
But first thing’s first: getting into the Kinect hardware and SDK to see what it is capable of.

Day 1 with the Kinect SDK
It took a bit of wrangling to get a Kinect For Windows sensor, as there is no distributer in Hong Kong, but I’ve now got my hands on one and a couple of books to get me going. I’ve started working through Beginning Kinect Programming with the Microsoft Kinect SDK by Jarrett Webb & James Ashley. I’m only a few chapters in, but so far it is a really good introduction to the SDK (which is surprisingly straightforward), and it has some good example projects which build up a neat library of re-usable code chunks for ‘real’ projects. The Kinect SDK is available in C++ and C# flavours, and the Webb & Ashley book uses C#. Since I haven’t done any C# work beyond a few Unity scripts, I’m also skimming through Programming C# by Jesse Liberty, from O’Reilly.
I also grabbed a copy of Making Things See: 3D vision with Kinect Processing, Arduino, and MakerBot by Greg Borenstein. This book deals with the unofficial Kinect libraries, OpenNI and OpenKinect. I’ve only briefly flipped through it so far, but it looks like it has some cool projects.
Hopefully I’ll have something to show from all this soon!
Recently it was the Chinese Lunar New Year, which means a brief spell of non-teaching time for me. So to keep myself amused, I created this little app. It shows a vinyl/rubber texture and when you drag your finders across it it makes fingers-dragged-across-rubber sounds! The audio is implemented with the FMOD Ex & Event libraries for iOS. I made this partly for my own curiosity and partly as a potential teaching resource.
The video was captured from the iPad simulator, as I’m not setup to record from an actual iPad, so the mouse cursor here represents a single finger. If you run the app on an iPhone or iPad you can drag multiple fingers around, each of which will create its own sound.
I started out by recording some raw material. I got some balloons, and squished, squeezed and stretched them at different levels of inflation. I’ve put the original recordings on Sound Cloud in case anyone else has any use for them (CC license)

Examining the recordings, the sound made by dragging a finger across a rubber balloon is actually a series of short, snapping sounds - almost like little gunshots. Presumably as the finger drags across the rubber, the rubber surface stretches with the friction, then then gets released and snaps back into place. The app works by selecting snap sounds according to the speed of the finger movement and concatenating them for the duration of the movement.

I separated out a whole bunch if individual ‘snaps’ and sorted them into two groups - big snaps and little snaps. In FMOD Designer I created 2 corresponding sound definitions and added the individual snaps to these definitions. The definitions are set to randomly choose their snaps on each spawn. These 2 sound definitions were added to an event, which has a single parameter called speed - the speed at which a finger is being dragged across the ‘rubber’ surface in the app. The speed parameter has a range of 0-100 pixels-per-second. The little snaps are used for 0-25, and the big snaps for the remainder. I added a little volume and pitch enveloping for extra interestingness.

For the actual audio implementation, each finger drag event in the app will cause a series of concatenated snap sounds. The snaps for a single finger should not overlap with each other (though multiple finger’s snaps can of course overlap). Therefore we want the FMOD event and it’s two sound instances set to oneshot - this will cause the event to automatically stop after it has played a single snap. As we’ll get into shorty, the app code will keep track of FMOD event playback and when a playback ends, if the finger is still moving, the speed parameter will be updated an a new event playback started. This causes the actual snap sound to be re-randomized on each playback. Setting the FMOD event to looping doesn’t re-randomize the snap sound on each iteration of the loop.
On to the app itself. The app is a simple single view application, for which there is an XCode template. The app’s UI consists of a single UIImageView covering the whole of the root view, showing a vinyl/rubbery texture (the image is from Mayang’s Free Textures).
Adding the FMOD iOS libraries to an XCode project is a different process from adding the Mac OS X version of the libraries (which was discussed in detail in my previous post). The OS X libraries are dynamic libraries, but Apple does not allow dynamic libraries in iOS apps, so the iOS versions are static libraries. The difference being that dynamic libraries are linked to your app’s executable at runtime, static libraries are linked at compile time.
iOS projects are complicated by the fact that the iOS hardware (e.g. an iPad) and the iOS simulator (i.e. your Mac) have different processors, meaning code compiled for one will not run on the other. Therefore the FMOD static libraries come in two flavours - iPhone and Simulator. Presuming you want to test code in the simulator and a device, you need to use both and link to the only the appropriate one when building for each target.
To do this, rather than adding the libraries to the project in the usual way, it is necessary to create conditional linker flags, conditional on the architecture being built for. To add the FMOD libraries in this way select your project in the source tree, click on the Build Settings tab and find the setting ‘Other Linker Flags’ in the ‘Linker’ section. Expand this section with the disclosure triangle if necessary. Hover the mouse over the ‘Debug’ entry and click the ‘+’ button that appears. A new row is added. From the popup on the left (which defaults to Any Architecture|Any SDK) select ‘Any iOS Simulator’.

Double click on the empty field on the right of the popup and you will get a pop-over window to enter the flags for the linker. Simply drag the simulator version of the FMOD Ex library, libfmodex_iphonesimulator.a, from the finder into this window.

If you require to use the FMOD Event system for playback of FMOD Designer projects, like I did in this project, also drag the simulator version event library, libfmodevent_iphonesimulator.a, into this window. To add the iPhone hardware versions of the libraries, create another new row under ‘Debug’, set the popup on the left to ‘Any iOS’ and add the iPhone versions of the libraries, libfmodex_iphoneos.a and libfmodevent_iphoneos.a, in the same way. Create the same two conditionals for the ‘Release’ build, and add the same libraries there too. Finally add the FMOD headers to your project by simply dragging them into your source tree, or setting your ‘Header Search Paths’ build setting to include wherever you have the headers on your system.
The FMOD libraries are depended on the AudioToolbox and CoreAudio frameworks, so don’t forget to add these to your project in the usual way.
Finally, the FMOD static libraries have some C++ code baked in them, so you will need to either add the text “-x objective-c++” to the ‘Other C Flags’ build setting, or change the extension of any .m files that make use of FMOD to .mm, which indicates Objective-C++ to the compiler.
In the rubber squeak app, all the code to run the audio (not that there is much) is in the View Controller class. This object holds an instance variable for the FMOD Event System, and one for the FMOD Event Group that contains the Squeak/Snap event. These get setup and initialized in the -initWithNibName:bundle: method:
To keep track of touches, the View Controller also has 2 dictionary instance variables, one that associates playback of FMOD events with a particular finger, and one that stores the timestamps of each touch’s last position update to calculate the movement speed. The build of the logic is done in the View Controller’s - touchesMoved:withEvent: method, which is called by the system each time a finger moves across the screen.
In this code, for each touch (i.e. finger) that has moved we check to see if there is an FMOD Event already playing. If so, the touch is ignored for this update. As mentioned above, the individual snap sounds that make up the finger dragging sound should not overlap. As these events are set to oneshot, they will automatically stop themselves after their snap sound has completed, so the (eventState == FMOD_EVENT_STATE_PLAYING) will return false.
If there is no snap currently playing, the speed of the finger movement is calculated from the difference between the timestamps of the current and previous update and the distance between the current and previous touch locations. A new instance of the FMOD Event is created, its speed parameter set and playback started.
At no point do I bother releasing any of the FMOD Event instances. In the FMOD Designer project, the event is set to allow for a maximum of 10 simultaneous instances (one for each finger) and when the system’s pool of event instances is empty, it is set to steal the oldest instance. Therefore the instances are recycled by the system. At most 10 instances can exist in memory when the app is idle. These would get recycled when the user begins dragging again, and will all get released when the system is destroyed in the View Controller’s dealloc method.
And that’s all there is to this project. Hopefully it is of use to anyone looking to get started with FMOD for iOS. I’ve put the full source code for the project, including the FMOD Designer project source and audio files, on github. You can download it all from here.
I recently needed to stream playback of remote MP3 files over HTTP on a Windows-based project and, for various reasons, didn’t want to use any third party libraries. So I used the Direct Show API - a media library that is part of the Windows SDK.
The Direct Show API is only available in C++ and is based on COM. COM (the Component Object Model) might seem a bit odd or antiquated if you haven’t come across it before, but it isn’t too complicated to get to grips with. In essence it is a method of abstracting object interfaces, whereby the programmer queries opaque objects for their capabilities. The returned interfaces are similar to Objective-C protocols. If you are already familiar with Object Oriented Programming, a little background reading (here or here for example) should be enough to come to terms with COM.
Direct Show is a graph based system - similar to Audio Graphs in Core Audio on Mac OS X and iOS. Direct Show provides a Graph Builder interface (IGraphBuilder), which can be used to manually construct playback graphs from their component nodes, called Filters. However, the interface provides a useful RenderFile method which can automatically construct a playback graph for a given audio or video file. Though the documentation is vague on this point, the URL of a remote file can be passed as a parameter to this method. The resulting graph will stream the audio in the file, rather than download it.
So I thought I would share some code I wrote that wraps up this Direct Show functionality into a C++ class with a very simple interface:
Each instance of the Music Player class creates a new thread for creation and control of playback graphs. The thread sits in an event loop for the lifetime of the instance. This prevents any buffering or fading from blocking the main thread, and the control thread’s event loop provides a mechanism for responding to events signaled by the playback graphs themselves - such as reaching the end of a file or encountering a network error.
Each new MP3 playback request is run on its own playback graph. As graphs are therefore being created and destroyed frequently, I wrapped this up in a couple of convenience functions, and created a struct type to hold a graph and all it’s COM interfaces that I make use of:
The Music Player class has an mCurrentPlaybackGraph member of the struct type to keep track of the currently playing graph and an mp3Request member to hold the details of a received playback request. When a Music Player instance receives a new playback request, it simply raises an event signal, which is processed in the control thread:
That’s really all there is to it. The full code is listed below.
A few months ago I posted a video outlining my final project for my MSc in Sound and Music for Interactive Games. This week I’m making my final submission, so today I’m posting project as it stands (I hesitate to use the word completed!)
A few details have changed over the intervening time, but the project remains much as it was described in the video. In essence I have created a piece of software that can be embedded in a game to allow a player to replace the game’s music with music from their own library. The system automates the choice of music by matching the emotional content of music with the real-time mood of gameplay. Thus, some of the information or feedback from the original game music can be preserved. I’ve been calling this system Harmonious.
I became interested in this area when I realized, much to my surprise, how prevalent the muting and replacement of game music was. In a previous post I gave results of a survey I ran that showed 25% of gamers never turn off game music, 64% sometimes muted and replaced music, and 11% regularly do so. Additionally, the major game platforms allow for game music to be replaced with player’s music at a system level, and indeed Microsoft requires that game developers implement this on titles released for the XBox 360. However, game developers expend great effort in designing adaptive and dynamic soundtracks that provide information on the game to the player. When this music is muted, this information is lost. Therefore my aim was to develop a system that created a middle ground: where players could listen to their own music, but some of the original music’s feedback was retained.
To demonstrate my Harmonious system, I created a game that makes use of it. This is a simple, arcade-style shooter called Space Defender. You can download this from the link below. The instructions document in the download will let you know how to play with and without the Harmonious system. The game runs on Mac OS X v10.6 and higher.

Download the Space Defender Game with the Harmonious system
You can read my full paper on how the Harmonious system was developed, how it works, and some evaluation of its effectiveness from the link below.

The full Harmonious paper (PDF)
Or, if you just want the short version, read on…
Overview
The Harmonious system is structured as follows:

A music analyser iterates through the player’s music library and determines the emotional content of each piece of music. The results of this analysis are stored in a database. Access to this data from the rest of the system is provided via a music manager object.
The system controller object presents both the programmer’s API to the game developer and the user interface to the player. The system controller receives cues for music to start, stop and change, and a representation of the current mood of gameplay from a game analyser object. When a new musical selection is required the system controller passes this mood value to the music manager, which determines the best matching music from the player’s library. The system controller then plays back this piece of music.
The only component of the system that is game-specific is the game analyser object. This object examines the game’s state and determines the current mood of the game, therefore this requires a game-specific implementation. However, a standard interface between a game analyser object and the system controller has been defined.
The controls exposed to the player included standard play/pause and skip controls, as well as controls that can fine tune the mood-matching process and ‘teach’ the database about the player’s preferences (more on this below).
Music Analysis and Emotion Matching
The music analyzer makes use of the Echo Nest platform. This is an online service that combines a server-based music analyser that mixes computational and textual approaches, and a database of “tens of millions” of pre-analysed pieces of music. This gives a large set of existing analysis data that can quickly be retrieved and should hopefully cover a high percentage of the music encountered by the Harmonious system. For music that has not already been analysed by Echo Nest, the system can fall back on using the server-based analyser, which can typically analyse a 3 minute piece of music in under 5 seconds. In order to validate the accuracy of the Echo Nest’s analyser, I prepared a test measuring its analysis results against the CAL500 reference dataset. Full details of the test and the results are given in this previous post, but the headline result was that the analyser scored 83% accuracy, which compares very favourably with other Music Emotion Recognition and Music Information Retrieval systems.
Correct analysis of music requires accurate identification of audio files, and to this end the Harmonious system makes use of the EchoPrint audio fingerprinting system, which I discussed in this post. The overall analysis workflow is detailed in this post.
The results of the music analysis is the emotional content of each piece of music represented as a shape on a two dimensional plane, where the axes represent the arousal (level of energy) and valence (positive or negative feeling) levels of an emotion. This particular model is popular in Music Emotion Recognition studies. The diagram below shows where common emotions would be located on such a plane.

The Harmonious system represents the emotional content of music as circular shapes on this plane. When it matches music against gameplay, the current mood of gameplay is represented as a point on this plane. Any music with a shape that encloses this point is a potential match.

Of all the music represented by the circuar shapes, the three pieces in colour would be a potential match to the gameplay mood indicated by the blue dot
Users can fine tune the matching process by indicating approval or disapproval of matches made by the system. A ‘like’ and a ‘dislike’ control are available on the user interface, and are mapped to the ‘1’ and ‘2’ keys during gameplay in the Space Defender game. When a player approves of a match, the system adds an additional circular shape to the music’s record, centered around the current gameplay mood. When the system is looking for a match, a piece of music with multiple shapes that enclose the gameplay mood point will have a greater chance of being chased over a piece with less shapes enclosing the point.

The piece of music represented by these shapes would have a match score of 2 as the music has two shapes enclosing the gameplay mood point. This music would be twice as likely to be chosen as music with only a single shape enclosing the point.
Similarly, when the user indicates disapproval of a match, the system adds an additional shape to the music’s record, but this shaped has a ‘negative’ effect, in that if the shape enclosed the current gameplay mood location when the system is searching for a match, the likelihood of selecting this music is reduced.

The piece of music represented by these shapes would have a match score of -1 as it has 1 positive and 2 negative shapes enclosing the gameplay mood point. Pieces of music with match scores below 1 are excluded from match selection.
Implementing Harmonious in a Game
Implementing the system in a game requires the programmer to create a game analyser object, which provides the Harmonious system with cues to start, transition and stop music and a representation of the current gameplay mood. This object must implement the following 4 methods, which will be called by the rest of the system:
-(void)activate;
-(void)deactivate;
-(BOOL)isActive;
-(NSPoint)currentGameplayMoodInArousalValenceCoordinates;
The first two methods simply inform the object that it should start and stop sending cues, the latter two allow the system to determine if the game analyser is active and get a representation of the game mood.
The game analyser object is provided with a reference to the system controller, and can call the following two methods on the controller to send cues:
-(void)playNewTrackForArousalValencePoint:(NSPoint)avPt;
-(void)fadeOutAndStopCurrentTrack;
In order to carry out the actual analysis of the gameplay, the analyser object has a reference to an arbitrary game controller object, which the programmer can set to any object in the game code that can provide the analyser with the data it requires.
Evaluation
Some small scale testing and evaluation of the Harmonious system has been carried out by players playing the Space Defender game with both its original adaptive music and music selected by Harmonious. Players found that the Harmonious system gave a similar level of feedback on gameplay as the original music, and that the music chosen by harmonious increased their focus on intense sections of gameplay. However this testing was with a small number of players, and the Space Defender game is limited in terms of the emotions provoked in players. Definitive evaluations will require increased numbers of players and more emotionally complex games.
Back in April I wrote a post on getting up and running with the FMOD API in an Xcode project. That post covered XCode 3 projects. Now Mac OS X 10.7 Lion has been released, and XCode 4 has become the new standard, so I thought I would write an updated guide, as things are a little different with XCode 4.
Install Names
FMOD Ex is a set of dynamic libraries. When an application uses a dynamic library, the runtime linker finds and loads that library when the application is loaded. On OS X, iOS and some other OSes, the means by which the dynamic linker can find the library is the library’s install name. The install name is just a file system path embedded in the library. For example, libcrypto.dylib the cryptography library available on the Mac has an install name of /usr/lib/libcrypto.0.9.8.dylib.
The FMOD Ex dynamic libraries, libfmodex.dylib, libfmodevent.dylib and libfmodeventnet.dylib by default have install names of ./libfmodex.dylib, ./libfmodevent.dylib and ./libfmodeventnet.dylib respectively. This means the dynamic linker will expect these libraries to be in the same directory as the application executable. Typically on the Mac, the actual application executable is contained within the .app package at Contents/MacOS/AppName, whereas libraries and frameworks bundled with the app are usually located in Contents/Frameworks/. I usually stick to this convention and place the FMOD libraries in this location, therefore the first step is to update the FMOD libraries’ install names to point to this location.
Apple provides a few wildcard tokens that can be included in install names to make them generic and reusable. These are @executable_path, which expands to the path of the application executable that is loading the dynamic library, @loader_path which expands to the actual binary executable which is loading the dynamic library - this may be the application, or some other library or framework, and @rpath which searches a list of locations for the dynamic library. @rpath is the most flexible of these tokens, as when set in the install name of the dynamic library, that library can be used in any applications, frameworks or any other type of project without needing further modification.
install_name_tool is the command line program used to alter install names. To change the install name for the FMOD Ex library, open a terminal window, cd to the directory containing the libfmodex.dylib file, and run the command:
install_name_tool -id @rpath/libfmodex.dylib libfmodex.dylib
libfmodex.dylib contains the low-level FMOD Ex API. If you also wish to use the higher level Event System API (for playback of projects created with FMOD Designer) cd to the directory containing the libfmodevent.dylib file, and change it’s install name in the same way:
install_name_tool -id @rpath/libfmodevent.dylib libfmodevent.dylib
Likewise, if you want to use the Event Network system to hook the Designer app up to your game for auditioning, change the libfmodeventnet.dylib file’s install name with:
install_name_tool -id @rpath/libfmodeventnet.dylib libfmodeventnet.dylib
The both the Event System library and the Event Network library have a dependency on the FMOD Ex library, so we need to update their links too. (You can see the dependencies of a library by running the ‘otool -L path/to/library’ command in the terminal). This is done by cd’ing to the directory containing the libfmodevent.dylib and libfmodeventnet.dylib files and running the commands:
install_name_tool -change ./libfmodex.dylib @rpath/libfmodex.dylib libfmodevent.dylib
and
install_name_tool -change ./libfmodex.dylib @rpath/libfmodex.dylib libfmodeventnet.dylib
Adding the libraries to an XCode project
In your XCode project, select the project file at the top of the source list, select the target application/game, click on the Build Phases tab and expand the ‘Link Binary With Libraries’ box, as shown in the image below.

Click for full size image
Click on the + button to add new libraries to link to. This will bring up a sheet with all the default system frameworks and libraries. Click the ‘Add Other’ button, then navigate to the FMOD dylib files. Add libfmodex.dylib. If you wish to use the Event System and/or Event Network, also add libfmodevent.dylib and libfmodeventnet.dylib respectively. This will add the libraries to the list of files in the project navigator source list. I usually tidy these into a group called ‘FMOD’ and add it to the Frameworks group.
To have these libraries copied to your application/game’s bundle when building, add a ‘copy files’ build phase by ‘Add Build Phase’ button and selecting ‘Add Copy Files’. Expand the new ‘Copy Files’ box and change the Destination to Frameworks. Click the + button and from the sheet that appears, select the FMOD library/libraries that you are using.

Click for full size image
To ensure that the @rpath token expands properly when the application/game is running, click on the ‘Build Settings’ tab and find the setting Runpath Search Paths. Add the value ‘@loader_path/../Frameworks’ to this list.

Click for full size image
We also need to change the application/game’s library dependency paths with install_name_tool. However, doing this once from the terminal is no use, as this executable is being destroyed and rebuilt each time you recompile. Therefore we will run a script to do this each time we (re)compile. Click the ‘Add Build Phase’ button again, and this time select ‘Add Run Script’ from the menu. Expand the new ‘Run Script’ box and enter the text:
install_name_tool -change ./libfmodex.dylib @rpath/libfmodex.dylib “$TARGET_BUILD_DIR/$PRODUCT_NAME.app/Contents/MacOS/$PRODUCT_NAME”
If you are using the Event or Event Network libraries, add the following as required (remember to separate each with a line break)
install_name_tool -change ./libfmodevent.dylib @rpath/libfmodevent.dylib “$TARGET_BUILD_DIR/$PRODUCT_NAME.app/Contents/MacOS/$PRODUCT_NAME”
install_name_tool -change ./libfmodeventnet.dylib @rpath/libfmodeventnet.dylib “$TARGET_BUILD_DIR/$PRODUCT_NAME.app/Contents/MacOS/$PRODUCT_NAME”
Next add the following header files to the project by dragging them into the project navigator or selecting File > Add Files To “Project Name”…
- fmod.h
- fmod_codec.h
- fmod_dsp.h
- fmod_errors.h
- fmod_memoryinfo.h
- fmod_output.h
- fmod.hpp (only required if you are using the C++ API)
These are all that are required if you are only using the low-level Ex API. If you want to use the higher level Event System API you also need to add the following headers:
- fmod_event.h
- fmod_event.hpp (if you are using the C++ API)
If you want to use the Event Network system add these headers too:
- fmod_event_net.h
- fmod_event_net.hpp (if you are using the C++ API)
Add that’s it! Your game/application can now use the FMOD API.
Today I finished off the music analysis section of my prototype music-gameplay matching system (see here for an overview of the project). The goal of the analysis sub-system is to examine music in a user/player’s digital library and produce representations of the music’s emotional content. As described in the video outlining my project, these representations are a shape on a 2-dimensional plane, where the axes of the plane represent Arousal (Energy) and Valence (Positive/Negative feeling). The final system will track the emotional content of gameplay as a point on this plane, and matching is based around finding music with shapes that enclose this point.
There are several approaches that I could have taken in the design of my analysis workflow. There is a great deal of work being done in Music Emotion Recognition (MER), a more tightly focused ‘sub-discipline’ of Music Information Retrieval (MIR) (for a comprehensive summary of recent efforts see Kim et al 2010). Many examples can be seen in the proceedings of the International Society for Music Information Retrieval conferences. Many researchers are taking a computational approach to MER, making use of custom processes built with lower level MIR tools such as Marsyas, LibXtract and the MIR Toolbox, or completely bespoke systems (for examples see Panda & Pavia 2011, Yang et al 2008, Lui 2003). However such systems can be extremely computationally expensive and as such the time taken to analyse several thousand tracks on a desktop-class computer would be prohibitive for a project such as mine.
A second approach is mining a pre-existing database of human created descriptions of music. Such databases come in two flavours: expert and crowd. Projects such as the Music Genome Project and the All Music Guide have databases of tags applied to pieces of music that have been carefully determined by music experts, and could present a valuable source of data. However these expert databases generally remain private due to the cost and effort of producing them. Crowd based music databases, such as last.fm are usually more open and accessible. Laurier et al (2009) showed that for basic emotional tags, such as happy, sad, angry and tender, crowd- or social-based music databases can be reliable sources. However, two issues remain with both expert- and crowd-based databases: No database can contain data on every piece of music that my system (or any other) is likely to encounter; and using language based tags as a datasource for a geometric based system requires a subjective mapping between language and numeric values. Such a mapping can only be as accurate as the values shared between the mapper and the tagger. For my system I would prefer to determine arousal and valence values by direct audio analysis.
I decided therefore that EchoNest would provide a good platform for my analysis system. EchoNest is a music information system that provides a backend for online music services and is used by the organisations such as the BBC and MTV. The EchoNest system is available free of charge for non-commercial uses. The platform combines a server-based music analyser that mixes computational and textual approaches, and a database of “tens of millions” of pre-analysed pieces of music. For my system this gives a large dataset of existing analysis data that can quickly be retrieved and should hopefully cover a high percentage of the music encountered by my system. For music that has not already been analysed by EchoNest, my system can fall back on using the server-based analyser. This analyser has the advantage of running on dedicated powerful hardware, and can typically analyse a 3 minute piece of music in under 5 seconds. The analysis time is therefore only limited by the time taken to upload the music to the server.
So I have developed my system around the EchoNest platform. As I mentioned in a previous post, I am also using EchoPrint, and open-source audio fingerprinting tool that creates fingerprint codes that can be used as search queries on the EchoNest database. Audio fingerprinting is the process of creating a unique code that represents an audio waveform that is often used in music identification apps on mobile phones.
The structure of the audio analysis workflow in my system is as follows: First iterate through the user’s music library and identify tracks that have not yet been analysed. Each ‘new’ track is then fingerprinted and looked up on the EchoNest database. If successful the fingerprint lookup will return a single result with analysis data for the specific version of the music on that track. If the lookup fails, the system falls back on searching the database by artist name and track title. If successful, this will return a list of matching tracks, which may include different versions of the music than that in he user’s library. Exact matches on the tracks’ metadata and duration are attempted in order to find the right version. Failing an exact match, fuzzy string matching and looser duration matching are used to score the results and find one that fits within acceptable tolerances. If still no match has been found, the system falls back on uploading the track for a bespoke analysis.
Once the analysis data for the track has been retrieved from EchoNest, the system reads through the timbre and loudness data to determine if the track has significantly different sections such that they may have different emotional content. The sections are fine-tuned to have their boundaries lie on bar boundaries. Then circular shapes representing the section in the Arousal-Valence plane are created. The centre of the circle is at a point on the Arousal-Valence plane determined be examining loudness, energy, key and mode data. The confidence rating of the data is used to determine the radius of the circle.
The flowchart below shows this workflow a little more clearly:

References
Kim, Y.E., Schmidt, E.M., Migneco, R., Morton, B.G., Richardson, P., Scott, J., Speck, J.A. and Turnbull, D., 2010. Music emotion recognition: A state of the art review. In: Proceedings of the 11th International Society for Music Information Retrieval Conference, 9-13 August, 2010, Utrecht (Netherlands). ISMIR, p. 255-266.
Laurier, C., Sordo, M., Serrà, J. & Herrera, P., 2009. Music Mood Representations From Social Tags. In: Proceedings of the 10th International Society for Music Information Retrieval Conference, 26-30 October 2009, Kobe (Japan). ISMIR, p. 381-386.
Liu, D. and Lu, L. and Zhang, H.J., 2003. Automatic mood detection from acoustic music data. In: Proceedings of the International Symposium on Music Information Retrieval, 26-30 October 2003, Baltimore, Maryland (USA). ISMIR.
Panda, R. & Pavia, R. P., 2011. Using Support Vector Machines for Automatic Mood Tracking in Audio Music. In: 130th Convention of the Audio Engineering Society, 13–16 May 2011, London, UK. AES.
Yang, Y.H., Lin, Y.C., Su, Y.F. & Chen, H.H., 2008. A regression approach to music emotion recognition. IEEE Transactions on Audio, Speech, and Language Processing, 16 (2), p. 448-457.
Posted 1 year ago
SoundAndMusicForInteractiveGames, Music Information Retrieval, Music Emotion Recognition, Research, EchoNest, EchoPrint, Audio Programming,