January 7, 2014
APL Staffer Scores Slice of Big Data Prize
This past summer, Bryan Gorman, chief scientist in the Force Projection Sector and part-time math analyst, took up a complex online challenge posed by Vincent Granville, a leading "big data" scientist. Gorman's mathematical approach to finding his answer was unexpected, and while it was only part of the complete solution, it led to a $500 award.
For Gorman, though, the cash was secondary to the chance to stretch his math abilities. "Once I got into the problem, the money didn't matter," he says. "This type of problem is not exactly in my comfort zone, but I like to challenge myself."
Gorman's finding provided the first step toward creating a different method of calculating a new statistic invented by Granville. The statistic is a more robust type of mathematical relationship known as a correlation coefficient, one based on a "footrule" proposed by statistician Charles Spearman in 1906. This alternative method allows two sets of numbers to be mathematically analyzed without being greatly influenced by outliers—numbers at the far upper or lower ends of the sets—and skewing the analysis. Big data is, almost by its nature, filled with outliers, which makes automated analysis very difficult.
What surprised Granville was that Gorman's answer was based in mathematical modeling instead of the expected data-crunching solution. Granville had solved a balance issue within Spearman's footrule with his statistic, but in order to compute the statistic, the greatest possible distance between a permuted sequence of numbers and two ordered sequences (all increasing and all decreasing) had to be found. The problem was there was no formula for that. It couldn't be computed because every possible arrangement of the numbers would have to be searched. Twenty numbers means searching through more than 2.4 billion billion arrangements.
Gorman's solution was to first find the lower bound, the first half of finding the maximum distance. As he looked at the problem and sample sets, "there seemed to be patterns emerging," he says. "I thought I could exploit those, and that led me to the lower boundary."
He sent his results to Granville in July and went on vacation, still thinking about how to obtain the upper bound of the maximum distance. While he was away, another scientist independently solved the entire problem, using a solution similar to Gorman's (whose breakthrough was acknowledged with the cash award). It is now possible to compute Granville's statistic with any sample size.
"This shows that you can [perform analyses of] massive data sets that won't be driven too far off by outliers," Gorman says. The finding makes it easier to perform any big data task, such as improving ranking algorithms such as Google PageRank, or gene sequencing.
"This is a small problem in light of the huge challenges out there," says Gorman. "There are so many problems that have been dropped for want of an efficient strategy to compute results. Now that we have new computational capabilities, we just have to remember we dropped them."
– V. Cikanek