Tuesday, July 7, 2009

Wolfram Alpha Review

Wolfram Alpha was released with great fanfare recently. Their overarching objective is very broad:

"Wolfram|Alpha is the first step in an ambitious, long-term project to make all systematic knowledge immediately computable by anyone". [http://www.wolframalpha.com/]

This can be compared to Google’s corporate statement:

"To organize the world's information and make it universally accessible and useful." [http://www.google.com/corporate/]

Many of the reviews deal with making side-by-side comparisons with results from different search engines. Even if Wolfram|Alpha succeeds in returning better results for a limited subset of data, it will fail to make a relevant fraction of “all systematic knowledge immediately computable” because doing so requires expert humans:

“… as the physicist sat, exhausted, immersed in the minutiae of food science. On the computer screen before him were raw tables of information from the U.S. Department of Agriculture, containing data on 7,000 foods, from blackberries to beef. He and a four-person team were "curating" the data, readying it for a new kind of online search.” [http://www.technologyreview.com/web/22834/]

Remember the old Yahoo where everything was placed into categories by humans? This is a much simpler problem than data “curation” described above. In fact, the act of organizing data and preparing it for systematic computation is a significant part of what a researcher does. Every field of science has their own way of organizing data and this requires specialists to organize and validate it.

The fundamental problem with Wolfram|Alpha is that it requires so much human intervention. What will happen a few years from now when all of the USDA’s data tables are updated or their data base format changes? Wolfram|Alpha needs to find and pay a team of experts to update and validate the new data.

My assessment of this fundamental flaw in Wolfram|Alpha’s approach is primarily based on what I know is required to deal with satellite orbit data. You can find satellite orbit data displayed in a comprehensive and clear manner at Wolfram|Alpha [http://www01.wolframalpha.com/input/?i=NOAA+15], which on the surface seems much more usable than what is available elsewhere [Start at http://sscweb.gsfc.nasa.gov/cgi-bin/sscweb/Query.cgi and then work your way through the menu to get equivalent data]. When a new satellite becomes available, who will update Wolfram|Alpha? And why would they be motivated to do it? If I have a detailed technical question about the implementation of an orbit calculation, who do I ask? If, 5-10 years in the future we see 100s of small satellites launched per year, who is going to update the Wofram|Alpha database?

The most basic problem is that if they are going to keep up with every special data set from every scientific subspecialty that has data that is computable, they need a community that will contribute. Wolfram’s approach is not community oriented, however. If I have an article about physics that could fit in http://scienceworld.wolfram.com/ or Wikipedia, I would choose Wikipedia. Many would justify this choice by saying that Wolfram Research is a corporation, and why should I give free contributions to a corporation if I don’t get anything in return? I have a more pragmatic reason; Wolfram Research is a corporation, and corporations come and go. When they go, their intellectual assets tend to follow. Releasing intellectual assets from a company that is dying or being swallowed requires money, which is not typically plentiful for a company in such a state.

- Bob Weigel

Tuesday, June 30, 2009

What is Computational and Data Science?



Welcome to the blog for the Department of Computational and Data Sciences at George Mason University. In the coming weeks and months, we will be exploring a wide variety of topics related to the research and teaching we do, along with other wider issues in our field. We plan to post about one entry per week on the site written by our faculty, students, our alumni and other guest bloggers.

This blog was designed to help connect our department with a wider community working in the Computational Science and Data Science areas. We hope that our entries will prompt discussions and help promote cross disciplinary collaborations.

For the first entry in our blog, I wanted to talk about the most basic questions - what is Computational and Data Science and why is it important?

Over the last few hundred years, the tools and techniques in science have evolved in sophistication. The level of abstraction in our theories and the quality of our data has grown with our ability to transform basic concepts into complex instruments.

A good example of this transformation is medical imaging. Theories created by Maxwell in the 1800's have been combined with basic concepts in atomic physics from the early 1900's to create magnetic resonance imaging. Even with this breakthrough technology, making true 3-dimensional images of the human body was not possible until there was enough computational power to change the data from the instrument into an image. Using these images, colleagues of mine in our department have created 3-dimensional simulations of blood flow through the human brain for individual patients. By using these simulations, surgeons can make better informed decisions about when to operate to fix cerebral aneurysms. The same technologies, namely MRI imagers with computers, are now being used to do experimental economics at Mason to find out how we make economic decisions.

Basic theory was used as a foundation to develop tools when it was combined with computational power and technology. The tools have been used in unexpected ways, as our ability to analyze the data and model it has grown with the increases in computational power.

Across all the sciences, we see computers being used routinely by all scientists. A theoretical physicist, for example, routinely uses Mathematica or Matlab to create numerical solutions to ODE's and PDE's. At the same time, an experimental physicist uses automatic data acquisition hardware to capture data during an experiment and to analyze the results. We see similar users of computers across the sciences and engineering, as well as increasingly in the social sciences. This leads us to an interesting question- If computers are used everywhere, can we really say that Computational Science is something separate from the disciplines like physics and biology?

In fact, the sciences borrow ideas and techniques from each other all the time. Scientists across the disciplines talk about "using a mathematical model", "using physics", or "using statistics." However, even though the physics, mathematics, and statistics are integrated into other disciplines, they are separate academic fields by themselves. Statisticians don't consider themselves biologists just because a biologist is using statistics tools, nor does biologists consider themselves statisticians because they are using statistics. The same is true with Computational Science.

Just as with mathematics, statistics, and physics, most uses of the Computational and Data Sciences are relatively simple. Doing a numerical solution to an ODE, doing simple data analysis tasks, graphing data or setting up a simple scientific database are all part of our discipline, but they are at the simpler end of the spectrum of the things that CDS scientists do. We still use basic tools like Matlab at times, but we spend more time both using and developing advanced tools to solve more complex problems. At least from my point of view, the difference between using Matlab and developing a parallel code that uses the MPI is perhaps the difference between using the tools of Computational Science and being a Computational Scientist. Similarly, the using Excel to analyze data and creating a system that handles ten's of terabytes per day is the difference between using the tools of Data Science and being a Data Scientist.

A recent report to the President entitled "Computational Science: Ensuring America's Competitiveness" outlines some of the challenges we are facing in this field. This report states:
"Though the information technology-powered revolution is accelerating, this country has not yet awakened to the central role played by computational science and high-end computing in advanced scientific, social science, biomedical, and engineering research; defense and national security; and industrial innovation... While it [Computational Science] is itself a discipline, computational science serves to advance all of science. The most scientifically important and economically promising research frontiers in the 21st century will be conquered by those most skilled with advanced computing technologies and computational science
applications."

The principle recommendation of this report was:
"Universities and the Federal government's R&D agencies must make coordinated, fundamental, structural changes that affirm the integral role of computational science in addressing the 21st century's most important problems, which are predominantly multidisciplinary, multi-agency, multi-sector, and collaborative...."

Of course, we in the Department of Computational and Data Sciences couldn't agree more.

-John Wallin