# Joseph Mate's Blog

## Unit Testing Common Requirements

Posted in Java, Programming, Unit Testing by Joseph on August 28, 2016

Suppose you wanted to unit test java.util.Map. How would you verify each implementation? Take a quick look at the Map javadoc, and you’ll find 20 implementing classes. Each Map implementation has common unit tests for requirements that you want to avoid copying. In this post I illustrate two techniques you can use for testing common requirements over implementations by testing my simplified version of java.util.Map.

I also encountered a similar problem when taking the Coursera Scala Parallel Programming course. The course’s assignments required sequential and parallel implementations. I noticed I ended up copying and pasting my tests of the sequential implementation to the parallel one. I came up with these solutions to de-duplicate the code.

# Introduction to Example

In this example we implement a simplified Map interface with three methods, get, put, and getKeys:

public interface Map<K,V>  {
void put(K key, V value);
V get(K key);
List getKeys();
}


We provide two implementations of the interface. InefficientMap uses two ArrayLists to store the keys and values and SortedMap which guarantees that getKeys() returns the keys in sorted order. The implementations are purposely naive because the focus of this article is the testing aspect, not performance.

# Method 1: Abstract Class

Abstract classes can be used to test the common requirements of differing implementations. This method provides the least duplicated code. You have all the common test cases in the base class which makes use of a factory method for generating the implementation.

To apply the inheritance you follow these steps:

1. Create an abstract class for the common unit tests
2. Add an abstract method that expects the extender provide a new instance the real implementation
3. Add @Tests the verify the requirements of the interface
4. Create a unit test class for each implementation and extend from the abstract unit test class
5. Provide the real implementation expected by the abstract method
6. Add any additional @Tests that test requirements specific to the implementation

## Step 1 and 2: Creating the Abstract Class

You create an abstract class with an abstract factory method for returning the real implementation.

public abstract class InheritingMapTest {
protect abstract Map<K,V> makeMap();
}


## Step 3: Add @Tests

In the abstract unit test class, we include requirements that are common to all maps. The tests make use of the factory method for getting an instance of the implementation.

Here we look at only one of the many test cases for the Map interface. The factory provides us with a mechanism for obtaining a new instance of the implementation.

Getting the value using a key that was replaced is should return the most recent value.

@Test
public void testValueReplaced() {
Map<Integer,Integer> map = makeMap();
map.put(0, 10);
map.put(0, 12);
assertEquals(Integer.valueOf(12), map.get(0));
}


## Step 4 and 5: Create a Unit Test Class for each Implementation

Now for each implementation we extend from the abstract class and implement the factory method.

public class InheritingSortedMapTest extends InheritingMapTest {
protected Map<Integer,Integer> makeMap() {
return new SortedMap<>(IntComparator.intComparator);
}
}


## Step 6: Add Additional Tests of Uncommon Requirements to Subclass

The sorted map comes with the additional requirement that the getKeys() method returns the keys in sorted order. We add tests of this new requirement into the subclass.

@Test
public void testKeysSorted() {
SortedMap<Integer,Integer> map = new SortedMap<>(IntComparator.intComparator);
map.put(2,12);
map.put(1,11);
map.put(3,13);
List keys = map.getKeys();
for(int i = 0 ; i < 3; i++) {
assertEquals(Integer.valueOf(i+1),keys.get(i));
}
}


# Method 2: Composition

An alternative to this pattern is to include all the tests in a helper class and have each unit test call the testAll() method. The problem with this solution is that you end up having a method which calls all your test methods which duplicates some code. More importantly, it makes it more difficult to figure out which test failed because you have to inspect the stack trace instead of click on the failed test in an IDE like eclipse.

1. Create a factory interface
2. Create class containing that accepts the factory in the constructor
3. Add tests the verify the requirements of the interface
4. Call each test method from a public ‘testAll()’ method
5. Create a unit test class for each implementation that creates and instance of the test class
6. Provide an implementation of the factory to the test class
7. Add any additional @Tests that test requirements specific to the implementation

## Step 1 and 2: Create the Factory and Test Class

The ‘Testing Class’ has a Factory for creating new instances of the implementation when it needs it.

public class CompositionMapTest {
public interface MapFactory {
/**
* Return a new instance of your implementation of Map<>;
*/
Map<Integer,Integer> make() ;
}

private MapFactory mapFactory;

public CompositionMapTest(MapFactory mapFactory) {
this.mapFactory = mapFactory;
}
}


## Step 3 and 4: Add Tests

We make use of the factory to create an instance of the implementation and verify the object matches the specification.

private void testValueReplaced() {
Map<Integer,Integer> map = mapFactory.make();
map.put(0, 10);
map.put(0, 12);
assertEquals(Integer.valueOf(12), map.get(0));
}


The ‘Testing Class’ invokes all the test methods in a “testAll()” method which the real test classes will call.

public void testAll() {
...
testValueReplaced();
...
}


## Step 5 and 6: Create a class for each implementation

To reuse the existing tests of the ‘Test Class’ we create an instance of it, providing it with a Factory for creating implementations. We invoke the test class’s “testAll()” method to check our implementation.

public class CompositionSortedMapTest {
@Test
public void testBasicRequirements() {
CompositionMapTest basicTests = new CompositionMapTest(
new CompositionMapTest.MapFactory() {
public Map<Integer,Integer> make() {
return new SortedMap<>(IntComparator.intComparator);
}
});
basicTests.testAll();
}
}


## Step 7: Add any additional tests for uncommon requirements

We write tests as we would normally for requirements that are not common to all Maps.

@Test
public void testKeysSorted() {
SortedMap<Integer,Integer> map = new SortedMap<>(IntComparator.intComparator);
map.put(2,12);
map.put(1,11);
map.put(3,13);
List keys = map.getKeys();
for(int i = 0 ; i < 3; i++) {
assertEquals(Integer.valueOf(i+1),keys.get(i));
}
}


# Conclusion

If you notice you’re copying and pasting unit test code for common requirements, consider one of these two techniques. However, these techniques are only guides because you will encounter cases that do not fit cleanly. As an exercise for the reader: how would you extend these techniques to apply to java.util.WeakHashMap? If you want to have a go I’ve made the source available on my github.

Tagged with: , ,

## Dining Philosophers Problem

Posted in Computer Science, Java, Job Hunt, Programming by Joseph on February 9, 2016

The book I’m going through, Programming Interviews Exposed, presents the Dining Philosophers Problem and their solution to the problem. Firstly, their explanation for their solution is weak because they do not describe the underlying principle they applied to prevent deadlocks. Secondly, their implementation of their solution still results in starvation. Finally, with a few modifications, I’ll go over my improved solution and compare the results of the two.

### The Problem

This question comes from the Programming Interviews Exposed book:

Five introspective and introverted philosophers are sitting at a circular table. In front of each philosopher is a plate of food. A fork (or a chopstick) lies between each philosopher, one by the philosopher’s left hand and one by the right hand. A philosopher cannot eat until he or she has both forks in hand. Forks are picked up one at a time. If a fork is unavailable, the philosopher simply waits for the fork to be freed. When a philosopher has two forks, he or she eats a few bites and then returns both forks to the table. If a philosopher cannot obtain both forks for a long time, he or she will starve. Is there an algorithm that will ensure that no philosophers starve?

### Applying an Order to Resources

In Programming Interviews Exposed, the authors recommend that all except one philosopher grabs the left fork, then the right. They mention this prevents the dead lock with no explanation or how someone would come up with this solution.

Firstly, you get deadlocks where there is a cycle in your resource allocation graph. In these graphs, you have directed edges from resources to threads when that thread has a lock on that resource. You have an edge from a thread to a resource when that thread is waiting for that resource. A cycle in this graph describes a scenario where no one can make progress because everyone in the cycle is waiting on each other to finish.

Chopstick1 -> Philospher1 -> C2 -> P2 -> C3 -> P3 -> C4 -> P4 -> C5 -> P5
^                                                                  /
\________________________________________________________________/

One way to prevent deadlocks, is to ensure that you never have cycles. If you can create an order to describe your resources and if all threads acquire the resources in that order, then you will never have deadlocks.

To solve the dining philosophers problem, you can assign a number to each fork, giving it an order. You always grab the lowest fork first. For instance philosopher 5 needs forks 1 and 5. Since 1 is the lowest, philosopher 5 will try to grab this fork first. This is how you arrive at the solution where all except one philosopher graphs the left first.

Another real world problem you can see this resource ordering technique applied to is bank accounts. Imagine you’re implementing a transfer from a withdrawing account into the depositing account. The naive way would be to first lock the withdrawing account, then lock the depositing account. However, this generates a cycle when the two accounts are trying to transfer to each other at the same time. First both transfers lock the withdrawing account, locking both accounts. Now both transfers try to lock the depositing accounts and waits forever because both accounts are already locked. Instead of locking the withdrawing account, then the depositing account, first lock the account with the lower ID. This will prevent all possible deadlocks.

### Barging

The idea behind Programming Interviews Exposed’s solution is correct as explained in the previous section, but their implementation is not. At a high level, their solution involves monitors and looks like this:

    synchronized(lowerLock) {
synchronized(higherLock) {
// philosopher eats the food
}
}


Unfortunately, Java monitors do not have a queue for threads waiting for a monitor. As a result, any thread waiting for a monitor even the most recently added one can be the next one allowed into the monitor. This is contrary to a programmer’s intuition as we expect the oldest waiter to be allowed in next. Not allowing the longest waiting thread is called barging. When the lives of philosophers are at stake, you want to prevent barging because it allows starvation.

To illustrate, I implemented the books solution of the problem in Java and recorded how many times each philosopher got to eat, with various eating durations:

When philosopher takes no time to eat:
count   philosopher_id
1 0
104 1
390 3
When philosopher takes 100 ms to eat:
count   philosopher_id
8 0
28 1
76 2
371 3
12 4
When philosopher takes 1 second to eat:
count   philosopher_id
16 0
48 1
74 2
337 3
20 4

From these results, we can see that philosopher 3 gets to eat an unfair amount of time and that when it takes no time to eat, some philosophers never get to eat.

### Preventing Barging

The easiest way to prevent barging in Java is to use fair locks. ReentrantLock’s constructor accepts a boolean fairness parameter. By setting this to true, it fulfills the .lock() calls in the order that they come, unlike a monitor. The resulting code would look something like:

    ReentrantLock [] locks = new ReentrantLock[5];
for(int i = 0 ; i < locks.length; i++) {
locks[i] = new ReentrantLock(true); // make it fair
}
// ...
// ...

lowerLock.lock();
try {
higherLock.lock();
try {
// philosopher eats the food
} finally {
higherLock.unlock();
}
} finally {
lowerLock.unlock();
}


I re-ran the problem with the fair locks and recorded how many times each philosopher ate:

When philosopher takes no time to eat
count   philosopher_id
108 0
38 1
48 2
263 3
38 4
When philosopher takes 100ms to eat
count   philosopher_id
85 0
95 1
99 2
131 3
85 4
When philosopher takes 1 second to eat
count   philosopher_id
89 0
91 1
100 2
127 3
88 4

From these results, we see that in all scenarios, no philosopher starved. Secondly, we see a more even distribution of eating compared to the previous barging solution. Unfortunately we are unable to eliminate all the unfairness as the resource ordering we used to solve the problem gives higher priority to the second highest philosopher resulting in an inherent unfairness. From this example we can see that if you have a scenario where you need to prevent starvation, make sure to use fair locks instead of monitors.

## How To Avoid Busy Waiting

Posted in C++, Computer Science, Java, Job Hunt, Programming by Joseph on February 4, 2016

Programming Interviews Exposed asks us to describe busy waiting and how to avoid it. The book provides a decent solution, but not the best solution. I’ll first introduce the problem and then build upon the solution until we have our best solution.

## Busy Waiting

You have busy waiting When one thread, waits for a result from another thread and you use and NOOP/empty loop to wait for that result. We will explore waiting for a result from two sub tasks to demonstrate and solve this problem.

    Thread thread1 = new Thread(new Subtask1Runnable());
// empty loop
}
// do something with result from thread1 and thread2


You want to avoid this pattern because it wastes CPU cycles:

1. Other threads you have running could of made use of those cycles
2. Other processes sharing the same machine could have also used those cycles
3. If the machine only has one CPU, thread2 could have made use of those cycles

## Poor Solution: Sleep

So the issue is that we’re wasting CPU cycles by repeatedly checking the condition. So what if we made the thread sleep for a bit.

    while( waitinigForThread1() && waitingForThread2() ) {
Thread.sleep(1000); // Sleep for 1 second
}


This is an improvement over busy waiting as we’re now wasting less cycles. However
there are still some draw backs to this pattern:

1. You’re still wasting cycles. It may seem like you’re wasting one cycle every
second. However, every time the thread wakes up it has to reload the thread’s stack, registers, and it has to recompute if the other thread is done. There’s still significant wastage.
2. How do you set the sleep duration?
1. Set it too long, and there is a long delay between the threads finishing and
the main thread recognizing that it’s done
2. Set it too short, and you increase the amount of wasted cycles

## Better But Still Poor Solution: Exponential Backoff

To take away a bit of the tuning, we can add an exponential backoff. To describe the process, imagine that your plane gets delayed for some unknown period of time. What do you do? Well first you might go to the washroom for a few minutes. Then you might get something to eat for half an hour. Then you might play a game on your cellphone for an hour. Finally, you’ll read a book for the next couple of hours. Your waiting tasks become progressively larger as you wait more. That’s the idea behind exponential backoff. You sleep for progressively longer durations as you wait for longer periods of time.

    private static final int MAX_BACKOFF=1000*60*60; // 1 hour
private static final int INITIAL_BACKOFF=100; // 100ms
int backOff=INITIAL_BACKOFF;
Thread.sleep(backOff); // Sleep for 1 second
backOff=backOff<<2; //bit shift to the left to multiply by 2
if(backOff > MAX_BACKOFF) {
backOff = MAX_BACKOFF;
}
}


Now this solution isn’t the best possible solution yet for a couple of reasons:

1. Ideally we want the the main thread to wake up only once or twice
1. Once when the first sub task completes
2. A second time when the second sub tasks completes
2. We still have to play with the starting back off, and max back off parameters to get decent performance

## Decent Solution: Locking or Monitors (Based on Programming Interviews Exposed)

As I hinted in the previous section, is there a way you can just wake up the sleeping thread once each task finishes? This is the solution that Programming Interview Exposed proposes. You can setup up some a lock and have the main thread wait on the lock. Then the two threads ‘notify’ when they have finished. Each ‘notify’ wakes the main thread and the main thread check to see if both threads finished.

This method uses the least possible cycles, but it is error prone. If not coded properly, you can end up with race conditions or deadlocks. Consider our two sub task example:

    Object lock = new Object(); // shared object to synchronize on
int doneCount = 0;
public void run() {
// ...
// do sub task 1
// ...
synchronized(lock) {
doneCount++;
notify();
}
}
});
public void run() {
// ...
// do sub task 2
// ...
synchronized(lock) {
doneCount++;
notify();
}
}
});

// this is the main thread again
synchronized(lock) {

while(doneCount < 2) {
wait();
}
}


This solution is great as the main thread only wakes up twice. Once when sub task 1 finishes, and a second time when sub task 2 finishes.

However, in this example we can mess up in many ways:

1. If I started the threads outside the synchronized block, the two threads
could finish before the main thread calls ‘wait()’. This would leave the main thread sleeping forever and no future notifies are called.
2. If I put the count outside of the synchronized block, the two threads might
update them at the same time and I’ll get doneCount==1 instead of 2!
3. If I forget to call notify() on one of the threads, then when the main thread
might never wake up because it didn’t get one of the notifications.
4. Placing the entire sub task in the synchronized block instead of just the
shared data accesses would of only let one sub task run at a time resulting in
no parallelism.

As a result, I do not recommend the solution presented in Programming Interviews Exposed.

## Good Solution: Futures and Other Concurrency Libraries

Instead of trying to carefully code this pattern every time, you can use libraries that have already correctly implemented these patterns. In our example where we’re waiting for two results from two threads so we can apply Futures.

    ExecutorService executor = ...
Future<String> future1 = executor.submit( new Callable<String>() {
public String call() {
/* do work for first result here */
}
});
Future<String> future2 = executor.submit( new Callable<String>() {
public String call() {
/* do work for second result here */
}
} );
String result1 = future1.get();
String result2 = future2.get();


From the example we can immediately see that it’s the most elegant solution, and less bug prone. But more importantly, we still use the least possible cycles as execution of the main thread only resumes when each of the futures complete.

Here we were looking at an example waiting on two sub tasks to complete. However, there are many different waiting scenarios and the standard libraries of C# and Java provide many concurrency facilities to make developers’ lives easier. Before trying to code up your own monitor or locking, see if you problem has been solved before.

## Summary

If you want to avoid busy waiting you can apply the write concurrent library for your problem. If the library doesn’t exist, you can try using monitors and locks, but be warned that there are many opportunities to mess up. Finally, don’t use busy loops or sleep to wait on tasks.

## Removing Characters From a String

Posted in Uncategorized by Joseph on February 1, 2016

Programming Interviews Exposed asks us to remove characters from an input string, looking for the characters in a second string. I am unhappy with solution they provide, particularly because of the reasons they use to invalidate what I believe to the better solution. You can check out my solution to the problem on my github repository.

In this problem we are given the following C# signature:

string removeChars( string str, string remove );


‘str’ is the original string we are removing characters from. ‘remove’ is the string that contains that characters that we need to remove from ‘str’. For example if
str=”hello world” and
remove=” wo”
the result should be “hellwrld” with all spaces, w’s, and o’s removed.

### Synchronized

My first issue, is they recommend not using the StringBuffer for the wrong reason. The reason you do not want to use StringBuffer, is because it’s synchronized in both Java and C#. Below is the Java source code from grepcode:

    public synchronized StringBuffer append(char c) {
super.append(c);
return this;
}


The methods are synchronized where necessary so that all the operations on any particular instance behave as if they occur in some serial order that is consistent with the order of the method calls made by each of the individual threads involved.”

Since the buffer does not exceed the scope of our method, and we know that only one thread is modifying the buffer, we don’t need the expensive overhead of protecting against concurrent access. We can go with the cheaper solution of using a StringBuilder (in both Java and C#).

### Extra Memory

Here the authors recommend building the string in the original character array and using two pointers to the array. One for the next non-deleted character and a second for where the next character is coming from. Ideally this is an excellent solution, if implementing correctly. I have annotated their solution with the issues.

    // str has N characters
// remove has M characters
string removeChars( string str, string remove ){
//NOTE: We use N extra memory because .toCharArray() copies the
//array in string because strings are immutable
char[] s = str.toCharArray();
//Note: Again, we use M extra memory because .toCharArray copies
char[] r = remove.toCharArray();
bool[] flags = new bool[128]; // assumes ASCII!
int len = s.Length;
int src, dst;
// Set flags for characters to be removed
for( src = 0; src &amp;amp;lt; len; ++src ){
flags[r[src]] = true;
}
src = 0;
dst = 0;
// Now loop through all the characters,
// copying only if they aren’t flagged
while( src &amp;amp;lt; len ){
if( !flags[ (int)s[src] ] ){
s[dst++] = s[src];
}
++src;
}
// NOTE: Again we use N extra memory because the string constrcutor
// copies the characters of s, into the new string
return new string( s, 0, dst );
}


Because Strings are immutable in Java and C#, toCharArray() in Java and toCharArray() in C# both have to allocate a new character array to prevent the caller from modifying the internal character array. Secondly you cannot force the Java String(char[], int, int) or the C# string(char[],int,int) [String->StringNative->StringObject] methods to use your char[] array. It has to copy it because the caller has a reference to the original character array and potentially modify it.

Overall, we used 2N + M memory, when the best possible solution in Java or C# uses 2N extra memory. We need N for the buffer and N for returning the result as a String. So we might as well have kept a StringBuilder anyways as it tremendously simplifies the code and uses potentially less extra memory because potentially a majority of the characters in the original string might be removed. I think this is an excellent example of what Knuth meant in “premature optimization is the root of all evil” because in this solution we didn’t make it more efficient and we made it more complex.

If you really wanted to use no extra memory, this should have been implemented in C/C++ with the following signature:

    void removeChars( char * str, char * rem );


Keeping the same code as above, but modifying str instead, allows us to incur no extra memory by modifying the contents of char*str in place.

### Summary

Given the solution provided in the book and the limitations of immutable Strings in Java and C#, we might as well have used a StringBuilder, incurring potentially less excess memory and a simpler solution. StringBuffer should not be used as it’s synchronized and we’re not using Threads. Finally, we were really wanted to achieve the author’s intended solution, we need to use C/C++ and change the method signature.

## Preparing For Interviews

Posted in Job Hunt, Programming by Joseph on January 29, 2016

I have completed my Master’s Thesis and all that’s left is paper work. As a result, it’s time to prepare for interviews. I’m going through ‘Programming Interviews Exposed‘ as practice. You can follow my progress at my github repository. Wish me luck in the interview process! I will occasionally make posts if I find something interesting, or can provide deeper insight than the book.

So far I have learned the following things, unrelated to the book, by going through the book:

• How to properly use github
• For the longest time, I’ve been committing to the master branch instead of merging into master from a separate branch
• How difficult it is to handle Unicode characters in C++
• How to use Gradle for your java builds of small projects
• I’ve found Gradle far more convenient that using ant/maven and ivy
• How to do unit testing in C++

Also, by implementing the solutions I’ve been able to practice many things which I would not get exposure to by working through the problems in my head.

Firstly, I reviewed my C/C++ knowledge which hasn’t been used intensely since undergrad. I’m now comfortable with things like manipulating pointers, differences between Java and C++, and the standard library. One interesting caveat is that when you want to used a HashTable data structure, do not used a map<>. That’s actually map backed by tree data structure and some interviews will notice this. You’ll want to use unordered_map<>, which was added in the C++11 standard.

Secondly, by implementing and testing the solution, you discover edge cases you might not have considered by working the problem through your head. I hit many more edge cases than I expected to with my heap implementation.

Finally, during interviews you will be asked to code up the solution. You do not want to waste your precious time recalling C++ or Java libraries. You want to be able to focus as much of your energy on tackling the problem, instead of tackling the language.

Wish me luck!

Tagged with: , , , ,

## Reading Books and Technical Papers on the Kindle (Method 2)

Posted in Uncategorized by Joseph on April 27, 2014

I’m trying to read a book similar to Elements of Statistical Learning (aside: the book is freely available by one of the authors, Professor Rob Tibshirani, here) on my Kindle, but the method I outlined in my previous post does not work that well. Moreover, look at all the wasted space on the left and right margins!

No matter what settings you apply (rotating and/or zooming), I have this margin that’s driving me nuts. Fortunately, after googling for PDF cropper, I came across briss. Briss is a PDF cropper that automatically crops the margins, and is able to automatically detect exceptions like left and right pages, or even if images are overflowing. Below is a screenshot from running briss on Elements of Statistical Learning.

Briss in this case detected the title page should be handled separately. Additionally, it detected that there is separate formatting for the margins of the even and odd pages.

Now the final step is getting it to display on this kindle. Unfortunately, in portrait mode there is still a margin because the Kindle tries to display the entire page.

As a result, you need to switch to landscape mode by going to ‘Zoom & Contrast’

Viola! No more wasted space. Now you have no excuse not to read those technical books you’ve been putting off the last couple of months!

Tagged with: , , ,

## Reading Technical Papers on the Kindle

Posted in Uncategorized by Joseph on January 7, 2014

## Introduction

I have a Kindle with a 6” screen and reading technical papers is difficult. However, I am fortunate that most of the papers I read follow the 2-column format. As a result, there is a technique available to make reading technical papers less painful. I have only tested this with my Kindle. I will be demonstrating the method using the LinkBench technical paper from SIGMOD. These instructions may also work for the Kindle Paperwhite as well. Please leave a comment letting me know if it does.

## The Quirks of the Kindle

This technique involves, changing the the orientation, the zoom, and properly using the arrow keys to navigate the page. You can use it without making any modifications to the Kindle or the document you want to read.

6. ### To go to the second column, do not use the right arrow key

Intuitively, you would go to the right side in order to view the second column by pressing the right arrow two times. However, on the second press, the Kindle is not smart enough to realize it viewing beyond the bounds of the document. As a result, it repositions as if the document kept going to the right.

7. ### To go to the second column, use the left arrow key

Here we are playing a trick on the Kindle to do the right thing. It wraps around and properly displays the second column. To go back to the first column we need to use the right arrow key.

## Conclusions

Reading two column technical documents on the Kindle is possible. It’s just not very intuitive to figure out on your own. Please let me know what Kindles and e-readers this trick works for.

## References

Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba Borthakur, and Mark Callaghan. 2013. LinkBench: a database benchmark based on the Facebook social graph. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD ’13). ACM, New York, NY, USA, 1185-1196. DOI=10.1145/2463676.2465296 http://doi.acm.org/10.1145/2463676.2465296 (Available here)

Tagged with: , ,

## An Eclipse and Hadoop Pitfall

Posted in Hadoop by Joseph on January 6, 2014

For one of my course projects I inherited a project that was implementing a dimensionality reduction technique in map reduce. The previous owner of the project exclusively ran it using eclipse and used Mahout in Action‘s examples as a starting point.

## The Problem I Inherited

The following pattern was littered through most of the Hadoop jobs:

public class Main extends AbstractJob {

/**
* Some parameter to be used by the Mapper and
* Reducer.
*/
private static int parameter;

public int run( String[] args ) throws Exception {
parameter = 100;
Job brokenJob = new Job( conf, "This doesn't work!" );
brokenJob.setMapperClass( BrokenMapper.class );
brokenJob.setReducerClass( BrokenReducer.class );
brokenJob.setInputFormatClass( TextInputFormat.class );
brokenJob.setOutputFormatClass( TextOutputFormat.class );
brokenJob.setOutputKeyClass( IntWritable.class);
brokenJob.setOutputValueClass( IntWritable.class );
brokenJob.setMapOutputKeyClass( IntWritable.class );
brokenJob.setMapOutputValueClass( IntWritable.class );
}

public static void main( String[] args ) throws Exception {
ToolRunner.run( new Main(), args );
}

/**
* This mapper is broken because it makes reference to a static
* variable set by the JobClient, on a different JVM.
*/
public static class BrokenMapper
extends Mapper < LongWritable, Text, IntWritable, IntWritable > {

protected void map( LongWritable key, Text value, Context context )
throws IOException, InterruptedException {
/**
* Do something that makes use of static variable 'paramater'
*/
}
}

/**
* This reducer has the same problem as the above mapper.
*/
public static class BrokenReducer
extends Reducer< IntWritable, IntWritable, IntWritable, IntWritable > {

public void reduce( IntWritable key, Iterable< VectorWritable > values, Context context )
throws IOException, InterruptedException {
/**
* Do something that makes use of static variable 'paramater'
*/
}
}
}


The above code works when you run it from Eclipse. It works because using Maven, Eclipse has all of the jars that it needs in order to run. However, it does not run it using hadoop -libjar. It’s running it using java -classpath. As a result, everything defaults to the LocalJobRunner. The significance of this is that everything is running on a single JVM. As a result, the variables set by the JobClient were visible to the Mapper and Reducer. In a real Hadoop job, this is not the case. The JobClient, Mapper, and Reducers run on separate JVMs, and most likely different machines. When you run the code above, you’ll get NullPointerExceptions or weird behaviour resulting from the default values of java primitives. This results in developers unfamiliar with the MapReduce paradigm to make invalid assumptions.

## How I Fixed It

One easy fix is to pass variables through the job configuration. Then in the setup of the Mapper and Reducer, parse the configuration. The code is given below. I have omitted a lot of code from the above example that would be redundant.

public static class FixedMapper
extends Mapper < LongWritable, Text, IntWritable, IntWritable > {

private int parameter;

protected void setup(Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
// we do not use Confugration.getInt(String) because we want
// to fail if the value is not present.
parameter = Integer.parseInt(conf.get("parameter"));
}

protected void map( LongWritable key, Text value, Context context )
throws IOException, InterruptedException {
/**
* Do something that makes use of static variable 'paramater'
*/
}
}


Unfortunately, this was not enough. Some of the variables passed between the mappers and reducers were matrices. They were on the order of 100s of megabytes in size. It would be a bad idea to place something so large in Hadoop job’s configuration. As a result, I saved it to the distributed cache, and loaded it in the mapper’s and reducer’s setup method.

## Conclusion

When you’re using Eclipse, understand what it is hiding. In this case, Eclipse was hiding that it uses java -classpath to run the main class from Maven. Secondly, it was also hidden that running the main class using java instead of hadoop uses the LocalJobRunner by default. By doing so, everything runs in a single JVM and you will fail to observe where you’re exploiting that false assumption.

Additionally, understand the framework you’re using is valuable is valuable in avoiding similar pitfalls. In this case, there was a failure in understanding the MapReduce paradigm. The JobClient cannot shared variables with Mapper and Reducer because they run on separate JVMs. This is not an argument against Eclipse. This merely a warning. I use Eclipse for all of my Java development. However, I have compiled and run java programs from the command line. This pattern extends to any tool you’re using. For example, what are Ant and Maven hiding from you to make your life easier?

Tagged with: , , ,

## Showing Mathematically That More Complexity Increases Test Error

Posted in Machine Learning by Joseph on October 26, 2013

People who understand machine learning know from experience that having to complex a model causes poor performance when you apply your models to the real world because it does not generalize. However, where does the mathematical intuition for this come from? Up until I saw today’s lecture, I only had a sense of this issue because of all the experiments I ran. Today our professor Dr. Ali Ghodsi presented a mathematical proof for a radial basis function network decreasing in test performance as a result of having too many hidden nodes. In the follow post I give the necessary background on RBFs and a summary of the proof for overfitting due to too many hidden nodes.

## Background – What is a Radial Basis Function Network (RBF)?

It’s easy to imagine RBF as a neural network with a few differences.

1. There is only one hidden layer.
2. There are only weights going from the input nodes to the hidden layer.
3. The activation function applied on the hidden layer is the normal distribution.
4. You cluster your data and take their mean and variance to determine the parameters for the normal distributions used in the activation function above.

## Result

Details of the proof are given in Professor Ghodsi’s and Dale Schuurmans’s paper: Ali Ghodsi, Dale Schuurmans, Automatic basis selection techniques for RBF networks, Neural Networks, Volume 16, Issues 5–6, June–July 2003, Pages 809–816.

By assuming that the noise in the training and test data sets are guassian, the paper proves:
$\Large Error(M) = error(M) - N\sigma^2 + 2\sigma(M+1)$
Where
M is the number of hidden nodes
N is the number of data points in the training set
σ is the variance of the gaussian noise assumed to be in the data set
Error(M) is the Test error as a function of the number of hidden units
error(M) is the training error as a function of the number of hidden units

## Conclusion

Hopefully now you have some mathematically intuition for why having too complex a model causes poor performance on new examples. Unfortunately, the derivation in the paper does not extend beyond RBF networks. However, the tools used in the proof might be applicable to other machine learning algorithms.

## Beginning First Term of Master’s Degree

Posted in Computer Science by Joseph on September 20, 2013

My research focus is Databases and Distributed Systems. My thesis is still undecided. One thing I have figure out though are my courses.

## CS848: Large Scale Data Management

Firstly, I’m taking CS848: Large Scale Data Management and it fits well in my interest of Distributed Systems and Database. As part of this class we read three papers a week; write 5 paper reviews; give one presentation on a paper and lastly put together a course project.

The papers we will be reading for the course are available below. The topics range from batch systems like HDFS/GFS and MapReduce to datastores that provide weak isolation levels like Dynamo and Big Table to ACID databases that scale (Spanner) to graph databases.
https://cs.uwaterloo.ca/~kdaudjee/courses/cs848/schedule.shtml

The paper I have chosen to present is Jan Schaffner, Tim Januschowski, Megan Kercher, Tim Kraska, Hasso Plattner, Michael J. Franklin, Dean Jacobs
RTP: Robust Tenant Placement for Elastic In-Memory Database Clusters SIGMOD, New York, USA, 2013

Stay tuned for some reviews, presentation slides, and anything else I’ve done related to the class.

## STAT841: Classification

Secondly, I’m taking STAT841: Classification. Here we learn the derivations of common Machine Learning classification techniques like Fisher’s Discriminant Analysis, Logistic Regression, Neural Networks, and Support Vector Machines. I’ll be posting any results from lectures that I find interesting. I’m taking this course even though my focus is in Distributed Systems and Databases because having a strong foundation in Machine Learning will hopefully allow me to apply Machine Learning techniques to Distributed System problems. Additionally, I have a strong interest in Machine Learning.