Is global warming real ? — Data Analysis
Hello this is a first post of a series of posts, which will show a script running in different configurations (sequential execution, parallel with mapreduce and parallel with Spark).

This series of posts is inspired from the Hadoop definitive guide 4th edition book , data is collected from ncdc. Now data is a bit messy as far as it is concerned, but here out goal to focus on finding maximum temperature in each year. For that we'll focus our efforts on extracting temperature.

Data is organised by files each file represents a year of observations. Basically there are stations all over the world and those stations collect data from sensors like station's coordinates, observation date, observation time, wind direction, air temperature, atmospheric pressure...

Before getting into the scripts details and output let's have a look at a data.

0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-
00781+99999102001ADDGF108991999999999999999999

This is one line record from the data file of year 1901, we can see that it is clearly not much expressive. In bold there's the year and the temperature which reads to 7.81 celcius. 



I won't go through the code because it's pretty straightforward, but feel free to drop a comment or send an email from contact page though!

Total amount of data crunched was about 11Gb, it took 27min on a i5 6Gb run computer...That was a bit long though !

Check out the code @ Github!

Clone the github repo and run scripts using the following command (this simulates a mapreduce job)


cat input/ | python mapper.py | sort | python reducer.py


 I've placed only a handful files in input/ directory at my github repo, for more data download it from ncdc's website or use this script it expects a starting and ending year.

./ncdc.sh 2000 2015



Trend looks like the following, and you can clearly see that temperatures maximum's rise year after year from 1901 to 1950.


Now this is not enough data to say whether global is real or not, first of all we need up to date data (trying to get my hands on that...it takes time you know) and we also need to cross this data with events occurring worldwide (natural catastrophes and so on).

So the answer is complicated but from the data we can clearly see a trend!

[favourite]

Leave a Reply

Your email address will not be published. Required fields are marked *