It’s our final day of work and fun at Acadia National Park. We arrived on Monday, went through orientation, and learned about the research problems of interest in relation to climate change. We spent Tuesday and Wednesday as citizen scientists, collecting data on bird migrations, barnacle abundance in inter-tidal regions, and understanding the challenges in that process. We also talked about how data science and a climate data lake could aid the scientists, educators, and resource managers in measuring, analyzing and predicting the impact of climate change on the plant and animal species in the park. Today, we fleshed out a plan in making those ideas more concrete.
We posed and formulated a response to questions like: What are the problems that the scientists at Acadia and Schoodic Institute seek answers to through the climate data lake? What are the list of data sources we could ingest into this data lake? What models can be built to answer questions with these data sources? What visualizations of the data and the models can we show to tell the story of climate change and it’s impact in the park? Who will consume our models and visualizations? Where will we host them?
Designing the Data Lake Solution
One of the problems of primary interest to the scientists at Acadia is measuring and predicting the effect of climate change on hawk migration. We chose this as our candidate for a pilot project because we believe that gathering all relevant data into a climate data lake and building models and visualizations to answer this question could serve as a template for answers to other important questions. The business use case for this question from Acadia and Schoodic’s point of view is to make the data available to the National Park Service (NPS) and consumable by not only scientists, but educators and citizen scientists as well.
One way of achieving this is through a web-portal showing interactive visualizations of hawk migrations where the within-year variability and inter-year variability and their dependence on climate factors are captured. People accessing this web portal would be able to drill down into a region of interest, observe hawk migrations in that region, and see the projected migrations given our ever changing climate conditions. In our initial assessment Pivotal Cloud Foundry could be a candidate for the web portal. We believe open source visualization tools like D3 can be used for telling our data science story.
The portal will be served by a climate data lake which will ingest data from multiple sources. The EMC Big Data solution powered by the Pivotal Big Data Suite, including Apache Hadoop® with HAWQ, is a good candidate to build the data lake.
The data sources from which we’ll be pulling data out of and ingesting into the climate data lake will include Hawkwatch and eBird for bird migration data, National Climatic Data Center (NCDC) or the British Atmospheric Data Center (BADC) for weather related data, and iNaturalist for data related to plant and animal observations (e.g., food for hawks).
Our extract, transform, and load (ETL) pipeline would include operations such as standardizing these data sources, converting them to a common frequency, and imputing missing values amongst other tasks. Once loaded into a data lake, we can join relevant tables and generate features of interest for our modeling and visualizations. For example, we could build a regression model to predict the time of arrival of a certain species at a given site in a given year, given climate factors such as temperature, precipitation, hours of daylight, wind speed vector, etc.
Such models could be built at scale using open source libraries such as MADlib or the many libraries available in Python and R ecosystems through PL/Python or PL/R. We could use the output of the model to show hawk migration predictions, which can be consumed by visitors to the portal. We could also build visualizations like decomposition reports, which could help citizen scientists understand the impact of various climate levers on hawk migration times.
The climate data lake, serving the visualization portal, will give citizen scientists an important tool to understand the data about climate change. Our team from the EMC Federation, Acadia, Schoodic and Earthwatch believe that this initiative will encourage more volunteers and citizen scientists to participate in and become advocates for combating climate change, realizing the need to act on it soon.
In our previous posts, we talked about how technology, data science and automation could help researchers focus on research while offloading their burden of data collection. We also spoke about making this endeavor rewarding to citizen scientists by helping them participate in such a cause, be it through visualizing the data and building models or in assisting scientists to collect data.
In our conversations over lunch today, Hannah, our field team leader and education projects manager for Schoodic Institute, pointed out that automated data collection would help them expand their citizen scientists program to those who can’t physically travel to the park. For example, websites like www.planktonportal.org contain high resolution images of microscopic organisms in water, taken and uploaded by researchers and volunteers. It also provides a tutorial for citizen scientists to learn how to identify and tag different micro-organisms. Once citizen scientists go through and get acquainted with this process, they can help out in the cause by manually labeling images with the respective microorganisms contained in them.
This enables two key things. One, anyone in the world who has a passion for a cause, such as conservation or climate change, could contribute towards it. Two, it could assist scientists and ecologists in their data collection efforts when they have resource and budget constraints.
Wrapping Up the 4 Day Expedition and Thank You
We wrapped-up our productive and rewarding week at Acadia with Maine lobsters for dinner (or salads for the vegetarians like yours truly) and a parting fistbump! From here we’ll be taking our plan to our respective companies in the EMC Federation and chart out next steps for the climate data lake. We’d like to thank Acadia National Park, the Schoodic Research Institute and Earthwatch for providing us with the wonderful opportunity to come here and work together towards understanding how the power of a data lake overlaid with data science can arm research scientists and build bridges to a collaborative citizenry to study and understand the effects of climate change.
Thank you for following our updates from the field. I greatly enjoyed this Earthwatch expedition. As someone who loves data and national parks (Acadia is my 22nd national park out of 59), I couldn’t have asked for more. I highly recommend it to others who’d like to partner with Earthwatch on similar causes. Climate change is an important challenge for our generation, and I feel fortunate to have the opportunity to make a minor contribution to this cause. We hope to share more updates in a couple of months.
- Check out our the full series with our blogs from Day 1, Day 2 and Day 3
- Learn more about Pivotal Big Data Suite
- Read more stories like this on the Pivotal Data Science blog
- Contact us and see how Pivotal Data Labs can solve your big data problem
About the Author