From Sea to Trees, Pivotal Data Science Looks at Climate Change in Acadia National Park: Day 3 Field Report

November 20, 2014 Srivatsan Ramanujam

featured-earthwatch-cloudsWith day one and two complete, we are on the third day of the Earthwatch expedition to Acadia National Park. Our team, comprising of employees from the EMC Federation, Earthwatch, the Schoodic Institute, and scientists at Acadia National Park, have shared a great deal of of climate change and phenology knowledge both in the field and the auditorium.

In today’s report, I’ll write about our field trip to Mount Desert Island. We will also discuss two problems that our scientists and data managers at Acadia National Park face in their research, and I’ll present my thoughts on how Pivotal’s data science methods, technology, and tools can help them tackle these problems. I continue to see how our knowledge, skills, and technology can complement each other in our effort towards understanding the impact of climate change.

Current Sensors, Data Processing, Models, and Visualization Tools

After breakfast, we headed out to Mount Desert Island, the main part of park which also includes headquarters. On our drive there, we stopped at several scenic spots, including Sand Beach, Thunder Hole, and the historic Carriage Roads, financed and directed by John D. Rockefeller Jr. in the 1930s. At the park headquarters, we were given a tour of the Archive Room, which contains a wealth of information in the form of historic notebooks, photographs, plant and animal specimens, and cultural artifacts associated with the park. The park management is in the process of digitizing these important archives and making them available on digital repositories like Dryad and DataONE. After lunch, we headed out to a weather station near Cadillac Mountain. There, we checked out the array of sensors measuring wind quality, speed, direction, precipitation, mercury levels, and temperature in this part of the park. We also saw the data logging system which records the data and sends it off to NOAA and the EPA. Additionally, some samples collected at this weather station are mailed out to processing plants as far away at Oregon.

Data-Science-Acadia-National-Park-Day-3-weather-station-P

Throughout the day, I spoke to Dr. Richard Feldman, an Earthwatch Scientist and Adam Kozlowski, a Data Manager from National Park Service (NPS) at the Northeast Temperate Network (NETN). We discussed the nature of the data sets they use, along with the computational and access problems with their current system.

Richard is modeling duck dynamics in the Prairie Pothole region of the United States and Canada. A fundamental question in ecology is how do changes in the environment affect the population of a species. This is a challenging question because, even when the environment is static, the population of a species can fluctuate depending on the density of the species, affecting survival and reproduction rates. Another complexity in modeling duck dynamics is the presence of observer error, which is quite common in manual data collection as we discussed in our blog yesterday.

In essence, Richard is trying to measure the effect of environmental factors in duck abundance at different sites given their abundance the previous year at the same site. Using a dataset of duck abundances from over 1000 sites, measured over 50 years, Dr. Feldman uses Structural Equation Models (SEM) in this work, and the posterior distribution of the parameters in his model are estimated using Markov Chain Monte Carlo (MCMC) sampling in tools like OpenBUGS. The process is repeated for 10 different duck species. This is both a data parallel problem and a completely parallel problem, which our tools and technology can solve at scale.

For example, we’ve implemented Bayesian Hierarchical Regression in the context of demand modeling, predicting demand for consumer goods as a function of meaningful explanatory levers such as pricing, product & geographical attributes, and weather. These models were estimated using an MCMC algorithm named Gibbs Sampling, leveraging tools such as Procedural Language R (PL/R) and MADlib. For more information about how our data scientists at Pivotal solve problems at scale, please refer to Pivotal Data Labs: Technology and Tools in our Data Scientist’s Arsenal, or contact us directly.

Scaling Statistical Models and Visualizations

We also had a chance to ask Adam Kozlowski about the data uses of the Northeast Temperate Network (NETN). One of the goals of NETN, as described in their mission objectives, is to detect changes in the properties of freshwater resources (including physical, chemical, or biological) within NETN parks. This includes changes which can not be explained by natural variability alone. During our conversation over dinner, we learned that Adam and his team are currently engaged in analyzing the trends in aquatic resources in about a dozen parks, including Acadia, and presenting them through interactive visualizations, developed using RShiny, to other researchers and employees within the NPS.

To do this, Adam’s team pulls data out of IRMA (Integrated Resource Management Applications), which has a large collection of documents, datasets, and publications related to NPS natural and cultural resources. The data that Adam’s team retrieves from IRMA is stored in a SQL database on a Linux server and is then queried and presented as visualizations using RShiny. At Pivotal Data Labs, a lot of our data scientists use and love RShiny. While R is a great tool for statisticians to build models and visualize data, what it could benefit from is a platform where it could scale to big data. Pivotal’s open source product, PivotalR, is one of the favorite tools amongst our customers who are R users. It allows users of R to interact with massively parallel processing (MPP) platforms such as the Pivotal Greenplum Database or Pivotal HAWQ. With this approach, people like Adam and his team can continue to use their existing skills, while applying their knowledge to much larger datasets, more complex models, and more advanced visualization tasks.

Creating Citizen Scientists as a Next Step

We capped the day by watching the 2012 documentary film Chasing Ice, where photographer James Balog shows timelapses of receding glaciers in Greenland, Iceland, and Alaska. The haunting, yet beautiful imagery of the giant glaciers receding ever faster made the mood quite grim in our auditorium. Climate change is an important endeavor for all the people of the world.

In one of our brainstorming sessions yesterday, Earthwatch asked us to put on the hats of scientists, educators, resource managers, and corporate employees respectively, and answer—what could we expect to get out of citizen scientists, and how would we like to make their experience rewarding as well? One of the common responses expressed by all these groups alike was to provide interested citizen scientists the opportunity to participate in building models to study the relationships between the stressors and the dependent variable of interest.

Tomorrow, we will split into groups and discuss how a climate data lake, built by the EMC Federation and working in conjunction with Earthwatch, the Schoodic Research Institute and the researchers at Acadia National Park, can help us study climate change and take citizen science a step further, engaging people to contribute.

Learn More:

About the Author

Biography

More Content by Srivatsan Ramanujam
Previous
New Key Features in Jasmine 2.1
New Key Features in Jasmine 2.1

For the past couple of years there have been two feature requests/rant inducers/fork justifications for Jas...

Next
An Easier Way To Deploy Cassandra Clusters
An Easier Way To Deploy Cassandra Clusters

The majority of applications being built today require some form of data service, whether its a traditional...