Joseph Mate's Blog

Reading Books and Technical Papers on the Kindle (Method 2)

Posted in Uncategorized by Joseph on April 27, 2014

I’m trying to read a book similar to Elements of Statistical Learning (aside: the book is freely available by one of the authors, Professor Rob Tibshirani, here) on my Kindle, but the method I outlined in my previous post does not work that well. Moreover, look at all the wasted space on the left and right margins!

No matter what settings you apply (rotating and/or zooming), I have this margin that’s driving me nuts. Fortunately, after googling for PDF cropper, I came across briss. Briss is a PDF cropper that automatically crops the margins, and is able to automatically detect exceptions like left and right pages, or even if images are overflowing. Below is a screenshot from running briss on Elements of Statistical Learning.

Briss in this case detected the title page should be handled separately. Additionally, it detected that there is separate formatting for the margins of the even and odd pages.

Now the final step is getting it to display on this kindle. Unfortunately, in portrait mode there is still a margin because the Kindle tries to display the entire page.

As a result, you need to switch to landscape mode by going to ‘Zoom & Contrast’

Viola! No more wasted space. Now you have no excuse not to read those technical books you’ve been putting off the last couple of months!

Tagged with: , , ,

Reading Technical Papers on the Kindle

Posted in Uncategorized by Joseph on January 7, 2014

Introduction

I have a Kindle with a 6” screen and reading technical papers is difficult. However, I am fortunate that most of the papers I read follow the 2-column format. As a result, there is a technique available to make reading technical papers less painful. I have only tested this with my Kindle. I will be demonstrating the method using the LinkBench technical paper from SIGMOD. These instructions may also work for the Kindle Paperwhite as well. Please leave a comment letting me know if it does.

The Quirks of the Kindle

This technique involves, changing the the orientation, the zoom, and properly using the arrow keys to navigate the page. You can use it without making any modifications to the Kindle or the document you want to read.

6. To go to the second column, do not use the right arrow key

Intuitively, you would go to the right side in order to view the second column by pressing the right arrow two times. However, on the second press, the Kindle is not smart enough to realize it viewing beyond the bounds of the document. As a result, it repositions as if the document kept going to the right.

7. To go to the second column, use the left arrow key

Here we are playing a trick on the Kindle to do the right thing. It wraps around and properly displays the second column. To go back to the first column we need to use the right arrow key.

Conclusions

Reading two column technical documents on the Kindle is possible. It’s just not very intuitive to figure out on your own. Please let me know what Kindles and e-readers this trick works for.

References

Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba Borthakur, and Mark Callaghan. 2013. LinkBench: a database benchmark based on the Facebook social graph. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD ’13). ACM, New York, NY, USA, 1185-1196. DOI=10.1145/2463676.2465296 http://doi.acm.org/10.1145/2463676.2465296 (Available here)

Tagged with: , ,

Posted in Hadoop by Joseph on January 6, 2014

For one of my course projects I inherited a project that was implementing a dimensionality reduction technique in map reduce. The previous owner of the project exclusively ran it using eclipse and used Mahout in Action‘s examples as a starting point.

The Problem I Inherited

The following pattern was littered through most of the Hadoop jobs:

public class Main extends AbstractJob {

/**
* Some parameter to be used by the Mapper and
* Reducer.
*/
private static int parameter;

public int run( String[] args ) throws Exception {
parameter = 100;
Job brokenJob = new Job( conf, "This doesn't work!" );
brokenJob.setMapperClass( BrokenMapper.class );
brokenJob.setReducerClass( BrokenReducer.class );
brokenJob.setInputFormatClass( TextInputFormat.class );
brokenJob.setOutputFormatClass( TextOutputFormat.class );
brokenJob.setOutputKeyClass( IntWritable.class);
brokenJob.setOutputValueClass( IntWritable.class );
brokenJob.setMapOutputKeyClass( IntWritable.class );
brokenJob.setMapOutputValueClass( IntWritable.class );
}

public static void main( String[] args ) throws Exception {
ToolRunner.run( new Main(), args );
}

/**
* This mapper is broken because it makes reference to a static
* variable set by the JobClient, on a different JVM.
*/
public static class BrokenMapper
extends Mapper < LongWritable, Text, IntWritable, IntWritable > {

protected void map( LongWritable key, Text value, Context context )
throws IOException, InterruptedException {
/**
* Do something that makes use of static variable 'paramater'
*/
}
}

/**
* This reducer has the same problem as the above mapper.
*/
public static class BrokenReducer
extends Reducer< IntWritable, IntWritable, IntWritable, IntWritable > {

public void reduce( IntWritable key, Iterable< VectorWritable > values, Context context )
throws IOException, InterruptedException {
/**
* Do something that makes use of static variable 'paramater'
*/
}
}
}


The above code works when you run it from Eclipse. It works because using Maven, Eclipse has all of the jars that it needs in order to run. However, it does not run it using hadoop -libjar. It’s running it using java -classpath. As a result, everything defaults to the LocalJobRunner. The significance of this is that everything is running on a single JVM. As a result, the variables set by the JobClient were visible to the Mapper and Reducer. In a real Hadoop job, this is not the case. The JobClient, Mapper, and Reducers run on separate JVMs, and most likely different machines. When you run the code above, you’ll get NullPointerExceptions or weird behaviour resulting from the default values of java primitives. This results in developers unfamiliar with the MapReduce paradigm to make invalid assumptions.

How I Fixed It

One easy fix is to pass variables through the job configuration. Then in the setup of the Mapper and Reducer, parse the configuration. The code is given below. I have omitted a lot of code from the above example that would be redundant.

public static class FixedMapper
extends Mapper < LongWritable, Text, IntWritable, IntWritable > {

private int parameter;

protected void setup(Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
// we do not use Confugration.getInt(String) because we want
// to fail if the value is not present.
parameter = Integer.parseInt(conf.get("parameter"));
}

protected void map( LongWritable key, Text value, Context context )
throws IOException, InterruptedException {
/**
* Do something that makes use of static variable 'paramater'
*/
}
}


Unfortunately, this was not enough. Some of the variables passed between the mappers and reducers were matrices. They were on the order of 100s of megabytes in size. It would be a bad idea to place something so large in Hadoop job’s configuration. As a result, I saved it to the distributed cache, and loaded it in the mapper’s and reducer’s setup method.

Conclusion

When you’re using Eclipse, understand what it is hiding. In this case, Eclipse was hiding that it uses java -classpath to run the main class from Maven. Secondly, it was also hidden that running the main class using java instead of hadoop uses the LocalJobRunner by default. By doing so, everything runs in a single JVM and you will fail to observe where you’re exploiting that false assumption.

Additionally, understand the framework you’re using is valuable is valuable in avoiding similar pitfalls. In this case, there was a failure in understanding the MapReduce paradigm. The JobClient cannot shared variables with Mapper and Reducer because they run on separate JVMs. This is not an argument against Eclipse. This merely a warning. I use Eclipse for all of my Java development. However, I have compiled and run java programs from the command line. This pattern extends to any tool you’re using. For example, what are Ant and Maven hiding from you to make your life easier?

Tagged with: , , ,

Showing Mathematically That More Complexity Increases Test Error

Posted in Machine Learning by Joseph on October 26, 2013

People who understand machine learning know from experience that having to complex a model causes poor performance when you apply your models to the real world because it does not generalize. However, where does the mathematical intuition for this come from? Up until I saw today’s lecture, I only had a sense of this issue because of all the experiments I ran. Today our professor Dr. Ali Ghodsi presented a mathematical proof for a radial basis function network decreasing in test performance as a result of having too many hidden nodes. In the follow post I give the necessary background on RBFs and a summary of the proof for overfitting due to too many hidden nodes.

Background – What is a Radial Basis Function Network (RBF)?

It’s easy to imagine RBF as a neural network with a few differences.

1. There is only one hidden layer.
2. There are only weights going from the input nodes to the hidden layer.
3. The activation function applied on the hidden layer is the normal distribution.
4. You cluster your data and take their mean and variance to determine the parameters for the normal distributions used in the activation function above.

Result

Details of the proof are given in Professor Ghodsi’s and Dale Schuurmans’s paper: Ali Ghodsi, Dale Schuurmans, Automatic basis selection techniques for RBF networks, Neural Networks, Volume 16, Issues 5–6, June–July 2003, Pages 809–816.

By assuming that the noise in the training and test data sets are guassian, the paper proves:
$\Large Error(M) = error(M) - N\sigma^2 + 2\sigma(M+1)$
Where
M is the number of hidden nodes
N is the number of data points in the training set
σ is the variance of the gaussian noise assumed to be in the data set
Error(M) is the Test error as a function of the number of hidden units
error(M) is the training error as a function of the number of hidden units

Conclusion

Hopefully now you have some mathematically intuition for why having too complex a model causes poor performance on new examples. Unfortunately, the derivation in the paper does not extend beyond RBF networks. However, the tools used in the proof might be applicable to other machine learning algorithms.

Beginning First Term of Master’s Degree

Posted in Computer Science by Joseph on September 20, 2013

My research focus is Databases and Distributed Systems. My thesis is still undecided. One thing I have figure out though are my courses.

CS848: Large Scale Data Management

Firstly, I’m taking CS848: Large Scale Data Management and it fits well in my interest of Distributed Systems and Database. As part of this class we read three papers a week; write 5 paper reviews; give one presentation on a paper and lastly put together a course project.

The papers we will be reading for the course are available below. The topics range from batch systems like HDFS/GFS and MapReduce to datastores that provide weak isolation levels like Dynamo and Big Table to ACID databases that scale (Spanner) to graph databases.

https://cs.uwaterloo.ca/~kdaudjee/courses/cs848/schedule.shtml

The paper I have chosen to present is Jan Schaffner, Tim Januschowski, Megan Kercher, Tim Kraska, Hasso Plattner, Michael J. Franklin, Dean Jacobs
RTP: Robust Tenant Placement for Elastic In-Memory Database Clusters SIGMOD, New York, USA, 2013

Stay tuned for some reviews, presentation slides, and anything else I’ve done related to the class.

STAT841: Classification

Secondly, I’m taking STAT841: Classification. Here we learn the derivations of common Machine Learning classification techniques like Fisher’s Discriminant Analysis, Logistic Regression, Neural Networks, and Support Vector Machines. I’ll be posting any results from lectures that I find interesting. I’m taking this course even though my focus is in Distributed Systems and Databases because having a strong foundation in Machine Learning will hopefully allow me to apply Machine Learning techniques to Distributed System problems. Additionally, I have a strong interest in Machine Learning.

Determining If A Boolean Expression Can Be False

Posted in Uncategorized by Joseph on February 6, 2013

My first hunch was that determining if a boolean expression can be false was NP complete. After trying to prove it, I came up for a proof that it’s actually in P:

1) We have any boolean expression S
2) Using a well documented algorithm, we convert it to Conjunctive Normal Form
3) Formula becomes (x1,1 v x1,2 v … v x1,n) Λ (x2,1 v x2,2 v … v x2,n) Λ … Λ (xn,1 v xn,2 v … v xn,n)
4) Since we are trying to solve a setting of x1,1 to xn,n that results in false, that is the same thing as solving for a setting that results in true in the inverted expression
5) Now we have ¬((x1,1 v x1,2 v … v x1,n) Λ (x2,1 v x2,2 v … v x2,n) Λ … Λ (xn,1 v xn,2 v … v xn,n))
6) Applying DeMorgan’s ¬(x1,1 v x1,2 v … v x1,n) v ¬ (x2,1 v x2,2 v … v x2,n) v … v ¬(xn,1 v xn,2 v … v xn,n))
7) Applying DeMorgan’s again (¬x1,1 Λ ¬x1,2 Λ … Λ ¬x1,n) v (¬x2,1 Λ ¬x2,2 Λ … Λ ¬x2,n) v … v (¬xn,1 Λ ¬xn,2 Λ … Λ ¬xn,n))
8) Now we can determine a setting that results in true, by using the ANDed group that doesn’t contain a contradiction. If all the ANDed groups, contain contradictions, then there is no setting that results in true.
9) This setting from step 8 that results in true for ¬S will result in false for S

This came up at work because we were trying to determine if a boolean expression could become false. Knowing this would allow us to make a decision on how to handle the query.

If I have made an error, let me know! If you’re confused about one of the steps, feel free to ask.

Cheers,
Joseph

Tagged with: ,

Pitfalls of Integer.hashCode() and Long.hashCode() With Partitioning

Posted in Java by Joseph on April 16, 2012

Problem

So we want to partition our data into nice even chunks. If you use Integer.hashCode(), Long.hashCode(), or modding by the number of your partitions then the size of each partition will depend on the distribution of your data and the number of partitions. Here’s a quick example that I wrote up to demonstrate.

Here we have the partition size as 8 and all the numbers are even. I place each number into a bucket by taking the modulus of it. Here are the results:

import java.util.Random;

public class HashCodeIssue {

private static Random random;

public static void main(String [] args) {
int partitions = Integer.parseInt( args[0] );
long seed = Integer.parseInt( args[1] );
random = new Random(seed);
int [] buckets = new int[partitions];

for(int c = 0; c < 100000; c++) {
buckets[hash(getNextNumber()) % partitions]++;
}

for(int c = 0; c< buckets.length; c++) {
System.out.println("size of bucket " + c + ": " + buckets[c]);
}

}

/**
* Using a distribution of numbers divided evenly over
* even numbers.
*/
private static Integer getNextNumber() {
return new Integer(random.nextInt(Integer.MAX_VALUE/2) * 2);
}

/**
* Returning the default hashCode() implementation.
*/
public static int hash(Integer i) {
return i.hashCode();
}
}
jmate@jmate-Satellite-C650D:~/Desktop/hashcode-issue$java HashCodeIssue 8 890453985 size of bucket 0: 24781 size of bucket 1: 0 size of bucket 2: 24959 size of bucket 3: 0 size of bucket 4: 25163 size of bucket 5: 0 size of bucket 6: 25097 size of bucket 7: 0 jmate@jmate-Satellite-C650D:~/Desktop/hashcode-issue$ java HashCodeIssue 8 234985345
size of bucket 0: 25250
size of bucket 1: 0
size of bucket 2: 24735
size of bucket 3: 0
size of bucket 4: 24866
size of bucket 5: 0
size of bucket 6: 25149
size of bucket 7: 0


Root Cause

Lets take a look at the source code of Integer.hashCode() to find out why the numbers are not being evenly distributed, despite using a hash.

java/lang/Integer.java

/**
* Returns a hash code for this {@code Integer}.
*
* @return  a hash code value for this object, equal to the
*          primitive {@code int} value represented by this
*          {@code Integer} object.
*/
public int hashCode() {
return value;
}


java/lang/Long.java

/**
* Returns a hash code for this {@code Long}. The result is
* the exclusive OR of the two halves of the primitive
* {@code long} value held by this {@code Long}
* object. That is, the hashcode is the value of the expression:
*
* <blockquote>
*  {@code (int)(this.longValue()^(this.longValue()>>>32))}
* </blockquote>
*
* @return  a hash code value for this object.
*/
public int hashCode() {
return (int)(value ^ (value >>> 32));
}


These hash codes are implemented for speed, as a result they don’t do a good job of jumbling up the numbers. Unfortunately, for developers who want partitions of uniform sizes, this is not a good thing. The issue is that all the values share a common divisor with the the number of partitions (4*2 and y=x*2). As a result, all the values end up in the same buckets.

Solution

There are a couple of solutions to this problem.

1. Adjust The Number Of Partitions

This seems unintuitive at first, but REDUCING the number of partitions to 7 will flatten out the sizes of our partitions because there are a lot less even numbers that share a common divisor with the prime number seven.

jmate@jmate-Satellite-C650D:~/Desktop/hashcode-issue$java HashCodeIssue 7 890453985 size of bucket 0: 14153 size of bucket 1: 14335 size of bucket 2: 14377 size of bucket 3: 14265 size of bucket 4: 14364 size of bucket 5: 14179 size of bucket 6: 14327 jmate@jmate-Satellite-C650D:~/Desktop/hashcode-issue$ java HashCodeIssue 7 234985345
size of bucket 0: 14193
size of bucket 1: 14256
size of bucket 2: 14374
size of bucket 3: 14362
size of bucket 4: 14267
size of bucket 5: 14321
size of bucket 6: 14227


2. Use A Better Hash

Now a lot of times you have no idea what distribution of numbers you have but you still want to be able to distribution the numbers uniformly over your partitions. Or, you need a particular number of partitions that will have a common divisor with a lot of numbers coming in. A safe way to do this is to use a hash function that jumble up the numbers better. In the code above, instead of using the default hash implementation, I use the md5 hash function.

/**
* Hash function that uses MD5.
*/
private static int hash(Integer i, int partitions)
throws java.security.NoSuchAlgorithmException {
MessageDigest m = MessageDigest.getInstance("MD5");
m.reset();
m.update(toBytes(i));
byte[] digest = m.digest();
BigInteger bigInt = new BigInteger(1,digest);
return bigInt.mod(new BigInteger(1,toBytes(partitions))).abs().intValue();
}

private static byte[] toBytes(Integer ii) {
int i = ii.intValue();
byte[] result = new byte[4];

result[0] = (byte) (i >> 24);
result[1] = (byte) (i >> 16);
result[2] = (byte) (i >> 8);
result[3] = (byte) (i /*>> 0*/);

return result;
}
jmate@jmate-Satellite-C650D:~/Desktop/hashcode-issue$java HashCodeIssue 8 890453985 size of bucket 0: 12629 size of bucket 1: 12510 size of bucket 2: 12550 size of bucket 3: 12470 size of bucket 4: 12689 size of bucket 5: 12339 size of bucket 6: 12311 size of bucket 7: 12502 jmate@jmate-Satellite-C650D:~/Desktop/hashcode-issue$ java HashCodeIssue 8 234985345
size of bucket 0: 12472
size of bucket 1: 12533
size of bucket 2: 12632
size of bucket 3: 12634
size of bucket 4: 12489
size of bucket 5: 12352
size of bucket 6: 12616
size of bucket 7: 12272


Conclusion

If your performance depends on the size of your partitions, you should take special care when using Integer.hashCode() or Long.hashCode(). By carefully selecting the number of partitions, you can avoid the problem. If you uncertain about the distribution of numbers you have or you cannot change the number of partitions, you should use a hash function that does a better job of jumbling up the numbers.

Now I realize this example was really contrived because the numbers being partitioned were even. I chose it because it’s really simple to understand and illustrate that  if you don’t use a good hash function or have the correct number of partitions, this problem is going bite you when your numbers have a special distribution. The last time I saw this problem in the real world is when the numbers we were partitioning on usually had a particular end like XXXXX7831. If you have encountered this problem, post a comment! I am interesting in hearing about it.

High Fructose Corn Syrup vs. Refined Cane Sugar Coca Cola

Posted in Fun by Joseph on March 12, 2012

Motivation

Recently my roommate and I got into an argument over being able to taste the difference between high fructose and cane sugar Cola. I hypothesized that you could taste the difference between the two because of the level of fructose. My roommate disagreed and claimed that fructose would taste similar to the cane sugar and that we would not be able to tell the difference. We decided to settle the dispute by performing an experiment.

Fructose vs Sucrose Experiment

Experiment

The items used in the experiment are High Fructose Coca-Cola, Cane Sugar Mexican Coca-Cola, and tiny slips of paper to put under the cups to secretly indicate what the cup contained. Person A would fill two cups for each type of cola and mark a piece of paper to indicate what was the contents of the cup and hide it underneath the cup. After finishing person A would leave the room and person B would enter the room. Person B would then shuffle the cups without looking at the pieces of paper and call in person A when he finished. Finally, person A had to drinking each cup and write down what was inside each. After writing down all the guesses we compared the guesses to the actual contents. Before repeating the experiment we washed out all cups with water.

Results

First two sets of trials, both my roommate and I were guessing at about chance. However, on the third trial I noticed that one cola was more bubbly than the other. After identifying that it was the high fructose one, I was able to get the answer 100% of the time. My roommate, also following the observation was able to identify the contents 100% of the time. As a result, we called off the experiment because we were able to identify the cola without attributing it to the presence of high fructose corn syrup. We saved the remaining precious cola for consumption during dinner.

Next Steps

Next experiment we will try to use just pure high fructose water and cane sugar water so that we can control the level of bubbliness. Secondly, we need to cover the cups, or have person A and B blind folded, as it might be possible to identify the contents based on the colour of the liquid. Third, instead of doing the two cups for each type, to make the experiment simpler and more rigorous, we should just fill two cups with three cases: high fructose – high fructose, cane sugar – cane sugar, and high fructose – cane sugar. With the two cups for each type, by guessing the first three cups you knew what the last cup was rendering the last guess giving us no additional information.

The consumerist has an article by Chris Morran about the difference in taste Coca Cola: We Don’t Need To Make A Cane Sugar Version Because You Already Have Mexican Coke. They interviewed Greg Galvez, vice president and general manager of Importation and Commercialization, Coca-Cola North America. He says, “our research shows that there is no perceptible taste difference between the products. Whether sweetened with high fructose corn syrup or sugar, a Coke is a Coke and both are ‘the real thing.” His wording is slightly ambiguous. Is he implying that people cannot taste the difference between the Mexican Coca-cola and the regular Coca-cola or is he saying that if we control all other ingredients then we cannot taste the difference between HFCS and cane sugar?

The Huffington Post has already performed this experiment documented in the article Coca-Cola Taste Test: High Fructose Corn Syrup vs. Sugar. They concluded that 85% of participants could tell the difference and that 80% preferred the taste of the Mexican Cola. None of the articles I found tried to control for the differences in the non sugar ingredients.

Is Mexican Coke the real thing? By Louise Chu

Mexican Coke a hit in U.S. By Kevin Pang

Unintuitive Scoping and Javascript Closures

Posted in Javascript by Joseph on March 4, 2012

I had this problem at work, and have simplified it and removed all context related to the project so this article can be easily consumed. The problem arises when you try to create a closure onto a local variable initialized in a for loop.

I will begin with an example that matches my scoping intuition, which does not use a closure:

function setupArray() {
var array = new Array();
var i=0;
for (i=0;i<=4;i++) {
var tmp;
tmp = i + 1;
array[i] = tmp;
}
return array;
}

function outputArray() {
var array = setupArray();
for (i=0;i<=4;i++) {
document.write(array[i] + "<br/>");
}
}

outputArray();


With this example, we get the output that matches our intuition:

1
2
3
4
5


Now when we modify the code to use a closure instead and we run into problems.

function setupArray() {
var array = new Array();
var i=0;
for (i=0;i<=4;i++) {
var tmp;
tmp = i + 1;
array[i] = function() {
return tmp;
}
}
return array;
}

function outputArray() {
var array = setupArray();
for (i=0;i<=5;i++) {
document.write(array[i]() + "<br/>");
}
}
outputArray();


Here is the output that I was not expecting.

5
5
5
5
5


All the anonymous functions bind to the same variable reference. It is not creating a new variable for each iteration of the loop. My solution was to just place the contents of the loop into a function.

function setupArrayEntry(i) {
var tmp;
tmp = i + 1;
return function() {
return tmp;
}
}

function setupArray() {
var array = new Array();
var i=0;
for (i=0;i<=4;i++) {
array[i] = setupArrayEntry(i);
}
return array;
}


This fixes the problem.

1
2
3
4
5


After posting this I did some additional research and discovered that this seems to be a common problem that developers encounter. For example, this post extremely valuable in understanding what is going on. The author also provides a similar solution. Check it out!

Tagged with: , , ,

Android Appendable List – Github

Posted in Android by Joseph on January 14, 2012

Hey everyone,

Due to some people having issues getting the code to run, I have posted the complete example code on github. Just download the code and open it up in eclipse.

https://github.com/josephmate/AndroidAppendableList
If you don’t know how to use git, you can download a zip file of the sourcecode above, or run this command:
git clone git://github.com/josephmate/AndroidAppendableList.git

Cheers,
Joseph