Wednesday, June 19, 2013

A platform for people to jam music in small public space

Being in a public space, such as a canteen or a plaza, you'll hear some music coming out of centralised speakers. Most of these music are of a soft style. They tend to be slow and smooth, without vocal part. Sometimes you feel like the background atmosphere could be made better. Or sometimes, you'd just like to show up your existence, send a message or to do something for fun. Or in some other times, you feel boring and just would like to find something to do. Then maybe you can  try to jam in the background music to make something interesting happen.

How about we invent a social network jamming platform(actually this has already been implemented by my team) fulfilling this need. Let's call it "WIJAM", meaning, "we instantly jam together over WIFI". The basic idea is easy: someone takes out his/her cell phone, opens an app or something, starts to create some melody, and the melody created is instantly mixed and broadcasted with the original background music and other melodies from other guys via a central speaker system.

But there are many problems underlying this scenario, and these problems lead to deeper research under the topic. The first problem naturally pop out is: do they really know how to jam? A huge problem. Assuming some of them are musical novices, they may have some very good music ideas in mind, but don't know how to express on traditional instrument layout such as a keyboard. OK, maybe we provide them with a fix musical scale, say a pentatonic scale or Ionian scale, so that they are within the scale. Then what about the key? Some simple background music use only a fix key without modulation. Even in with such background music, the novices still need to choose a key to apply the scale. How is it possible? Not to mention those grooves that change keys or even change available scales. No, we cannot let users to determine such a lot of things, we should leave them as comfortable and enjoyable as possible. This leads to the idea of a master controlling system to instantly assign keys and scales to all the users. Note that the final outcome of the jamming is playbacked via central speaker, this master system is also in charge of collecting the performance from players and distribute them to the speakers.

Yet this is far from the end of the story. As you know, the touch screen of most mobile phone is relatively small. If you ever had an experience playing a mobile phone keyboard, you probably think of it a bad experience because the keys on the screen are too small to touch. Back to our scenario, what should the master actually assign to the users? A real piano keyboard contains 88 notes, containing approximately 7 octaves. The number of all the notes of a C Ionian scale(which contains 7 identical pitch class) is approximately 7*7 = 49. But obviously we cannot afford all of the 49 notes on a single cell phone screen. Note that most of the time piano players only focus on 2-3 octaves in the middle of the keyboard layout, namely they seldom go to the very low pitch or very high pitch part. Also note that our scenario has background music already, which probably contain the bass part. So we can safely omit 3 lowest octaves and 2 highest octaves, keeping about 2-3 octaves, which still have 14-21 notes. This amount of notes is good enough for normal expression. But then how should we place these notes within the touch screen?

Actually this is one of the crucial issues that determine the successfulness of this application. It should be admitted that up till now no final decision is made upon this issue. One simple way we adopted at the beginning is to evenly dividing the screen into 4-row 2-column. With this the user has 8 note to express at a time. The master has several 8-note patterns to be ready to assign to the users and each of these pattern is within a certain scale such as Ionian or Dorian. The advantage of this layout is every note has a relatively large touchable area, and the novice user's expression freedom is under controlled to be within a scaled 8-note. But the disadvantages seems outweigh the advantages. For those whose want more notes, they can't get it until the master assign a new pattern. For normal users, they cannot differentiate the notes until they play and listen. They'll soon feel boring when they realize it's non intuitive to control their own expression. There are two facets in the drawbacks described above: 1. the expert users want more freedom, they need detailed control panel; 2. normal users want more control over their expression, they need intuitive control panel.

To cater for both users, the system design rule "leave it for the user" need to be applied, namely, two layouts should be designed and let users choose their preference. For the expert users, a note-based layout should be designed. It contains 16 - 32 notes, and should be layout in a systematically way that is easy to start with and difficult to be virtuoso at the same time. For the novice users, a graphic-based, or drawing-based layout should be preferred. It maps the drawing to up and down of the melody line, which corresponds to the feeling expressed by the users. The algorithm will determine which pitch to use according to a certain chord-scale combination provided by the master. If you are keen enough, you may notice that there is an issue regarding the drawing-based interface: how can the user express "rhythm"?

One approach is to use finger motion to signal a note-on of the new note and note-off of the old note, where the new note is mapped to the new position of the finger. For instruments such as guitar and piano we don't even need to signal the note-off most of the time, and it will be signaled automatically after the note finish decaying. This seems a workable approach, which we haven't implemented yet. Another approach is to use one finger to signal the rhythm, another finger to draw the melody. This is also interesting and quite convenient indeed.

So much for the interface issue. What about the overall performance outcome of this system. How to make all the performances by an ad-hoc group of scattering experts or novices make sense, fit together, or at least, sound good. This problem is two-folded. The easier and more fundamental one is how to make the mixture sounds good, while the more difficult one is to let it makes sense.

So how? To tackle the "sound good" problem is relatively easy. It demands something called algorithmic mixing and mastering. Theoretically(I do not have any source at this statement), the sum of any number of channels of any sound can be made comfortable to human ear as long as it is well mixed and mastered, regardless of the underling musical structure(such as, chord progression) or whatever. Namely, we can always manage to make it sounds comfortable. But the problem is how. As this has not yet been implemented, so it cannot be told for the moment. But let's make some guess. Say for example a very simple algorithm would be to set the volume of every channel to 1/n, where n is the number of channel. This makes sense, but not an ideal solution, as you may argue that what if some of the channel have higher weight. So here comes the problem, how to determine which channel has a higher weight? One approach is again to leave it to the user, but since we assume most of the users are musical novice, it may not work as wished. Note that our scenario is public space jamming, which is very different from an on stage live performance. On stage performance yields a sense of being focused, while our scenario yields a sense of scattering, in which every one enjoy themselves being hidden instead of being watched. So the "weight" doesn't convey the same meaning as before. Interestingly, people participating in this jamming will probably also wish their performance be heard by all other people. By combining these two observations, we can safely derive a bottom line of the auto mixing and mastering mechanism, which is to let every player at least be able to hear their part of contribution   from time to time. By this key finding, we can safely write an algorithm involving some randomness to implement this function. Not a big deal yet. 

The big deal is how to make the outcome make sense. To make sense, a logical expression of music is needed. It's more than to comfort our ear, but to comfort our mind. Since the basic assumption is the participators mostly are musical novice, this problem becomes an algorithmic composition problem. It could be a classical algorithmic composition problem when the musical content involved are pitches and timbres of standard instruments, or it could also be a new algorithmic composition problem when more sound synthesis elements are involved. Again this could be divided into two sub-problems. To tackle the classical one, we should attempt a more strict algorithmic composition approach, which contains a lot of AI stuff and far from fully developed yet. The algorithmic composition techniques can be found in lots of literature. While a "modern" algorithmic composition" problem, which aims to output some modern musical style such as electroacoustic, is relatively easier as I see, because the aesthetics of these are much more subjective than those music within the range of classical music theory. And for implementing the algorithm, there are several ways, one way is to implement the algorithm on user's side, therefore perhaps every user when jamming is within their own algorithmic logic. But this has disadvantage as you may easily notice is that what if their "logic" collide with each other, so that the outcome of the jamming is not so pleasant? True! So we have another approach. In this approach, the algorithm is implemented within the master side, therefore master can cooperate all the users to create a piece of good music. If implementing like this, the master is actually an algorithmic composer and conductor as well. A third approach would be both master and users implement the algorithms, while there is a feedback channel from master to the user, with the feedback indicating the user's algorithm to make a certain change to adjust itself to the whole performance. I think the third approach is the optimal one. Talking more about this topic is out of my range of ability for the moment, so I'd better stop talking here.

And there are still other issues. The audio engine, for now this project uses AUPreset +AudioUnit+AVSesssion to make sound, the AUPreset file points to the prerecored instrument samples by Apple Garageband. I don't know whether this will be legal or not(but according to the official statement it seems legal). Anyway, a more interesting and challenging approach would be to try to use the mobile STK, mobile csound, or ChucK for sound synthesis. Hopefully with one of these engine the size of the app can be greatly reduced, and the app is filled with some of the most advanced stuff as well. Another issue is whether to use OSC instead of MIDI as the music performance transportation media, since it seems OSC has a lower network transport latency. We'll see. A third issue, which is a big one, is the evaluation issue, and I'm gonna use the next two paragraphs to discuss it.

This kind of work, if ever published to the academic journals or papers, will definitely confront a problem of evaluation. In other research areas, such as computer architecture, the evaluation can be done quite straight forward. There will be indexes indicating whether an architecture(or a certain architecture improvement) is good or not, the most widely used ones being "performance", which indicates how fast a system runs. While in the field of computer music, the evaluation problem becomes much more ambiguous. How to determine whether a computer music system is good or bad? To narrow down the range of discussion, in what sense can we say that a network collaborative system is good. One approach that can be found in many papers is to "let the audience judge". In these papers, the "feedback" of the audience is presented, some of them being the subjective feeling, some of them being advises, some of them may also be questionnaires. Similarly, in a paper I saw the evaluation method is to post the outcomes of the computer music system on the web and let viewers rate them. Besides, there is "let the participants judge" method. The logic behind these two mentioned evaluation methods are all quite natural and obvious, since music is a aesthetic process, the final judgement should be made by human being.

But there is still another approach, which should be called "machine appreciation". This method should be implemented with "machine listening " in a very high level sense. Ordinary machine listening only cares about the audio material and the structure behind the audio material, while machine appreciation should demand a higher level of machine listening which cares about the musical material and also the musical structure. Of course every thing can be scaled down to a simplest case. In the case of machine appreciation in the context of classical music structure expression, the simplest algorithm only needs to take care of whether the notes are within scale or whether the voice leading obeys the rules. But of course, as you will agree, this is far from enough. Whether machine is able to appreciate music is itself also a big big question to be answered.

Nevertheless, we can use machine to do some measurement, such as whether by doing such and such the users are becoming more active, or whether such and such can make a collaborative system being more responsive. These are something machine can absolutely do.

So much for the evaluation part. Now I guess I've already given an introduction to this collaborative music jamming application in small public space. I've discussed the big scenario, the role of master and users, the user interface issues, the output quality issue as well as the evaluation issue. With all of these, I guess an extremely great public space jamming application can be created! Hopefully it can be done soon!

