Global warming continued — AWS mapreduce job
Continuing from latest post, we've seen that sequential execution is not the solution. There's way too many data to process and that would take forever if we keep adding on more yearly observations.

Now in this blog post we'll go through the steps of running previous code in an AWS hadoop cluster.

Scripts are already fit for a mapreduce job, we won't change anything. Real change is how to setup and run in cluster mode.

Steps are going to be as follows:
  1. Create a key and download it to local computer.
  2. Create a role and attach policies to it.
  3. Create an EMR cluster.
  4. Create a streaming step.
  5. Check out the results in S3.

 You could go through AWS tutorials for that, they go step by step throughout this process. But I'll cover a minimalistic approach on how to configure the cluster, it should be enough for you to get started.


Creating a key pair
 This key pair is the mean by which we can connect to our cluster, you will need it every time you create a cluster. You just attach it upon creation time, so better keep it safe. You can go through this guide: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html

Then download it and keep it in your home directory for instance.

 Create a role with policies
Now at first it is confusing (IAM, role, policy etc...) but policies are barely here permissions you set for a given user. So when we will create our cluster we will give it a role with pre-set policies that way we will have control over what users can do with this cluster. Restrict access to other services for example, you name it!

Now what we need to do next is to navigate to IAM(Identity and Access Management) users, search for it in search bar; once in go to roles as bellow


Next we need to add a proper policy, administrator access:


And finally give it a name:



Creating an EMR cluster

This is where we create our 3 nodes cluster, the process couldn't be easier, you just have to set the EC2 key pair and EC2 instance profile properly. 

The EC2 key pair is the key we did set up earlier, it's the own which enables you to ssh to your cluster, as for the EC2 instance profile its's the profile we created with administrator policy.

The end should look something like this.




If you want custom Apache Hadoop common projects you select the application which suits your needs (Go to advanced options to select specific components)



 Creating a step
  •  Now steps are basically jobs you configure to run, there are different kind of jobs depending on what you want to do and what tool you want to use. We are going to use mapreduce streaming job, since we have our script in Python we will need that streaming API. 
  • The process is straightforward attach, mapreduce files from an S3 bucket as well as input data, then specify the output bucket and leave the rest to default. 

NB: Keep in mind that I'm persisting data in S3 since I terminate the EMR cluster once the step is done and yeah I cannot copy Giga bytes of data from S3 to HDFS everytime I want to run a mapreduce job.
 
   Results
 
 


As expected our script did better than the sequential one...14 min previously 27 min. Though it's not lightening speed, we need to add more nodes for that and add more data. That's how mapreduce performs well.

Now most importantly don't forget to terminate your cluster and check that no EC2 instances are running, unless you want to go bankrupt.

You see, setting up a cluster in AWS is really childish game, next post will adapt the script to Spark and run it on AWS.

How excited are you ?

Leave comments and questions bellow, see you in the next post !

[recent]

Leave a Reply

Your email address will not be published. Required fields are marked *