All guys who have a interest in BIGDATA are very familiar with the term HADOOP. So there is lots of things to do with HADOOP but many myths are rolling everywhere about HADOOP and on one of them me and my team worked on it. We are just trying to provide right meaning i.e. right arth. We do this thing under the mentorship of @viamldaga .
MYTH:
In HADOOP Cluster, data blocks send in parallelly to the datanode.
Lets start and see what we find out!!!
Firstly I am showing our infrastructure of HADOOP Cluster.
We launch our HADOOP cluster in which 5 datanodes connected with a namenode and a client connected with namenode. Public & Private ip of all machine are showing in the image.
So Client upload a file on cluster. Size of the file aprrox 2.3 mb which is distributed in 3 blocks of 1 mb (by default the size of block is 64 mb but client reduce the size of block by command).
3 Blocks uploaded on 3 machines and every Blocks have 3 replicas now we analysis how data transfer is going on, on the cluster. For this we are using tcpdump command with port number 50010.
We run tcpdump command on all the machine and store the logs in a file.
So one thing we find out that the client directly upload the file on the datanode master is not the one who upload the file on the datanode. We find out this thing via when we run tcpdump command on master with port number 50010 master does not receive any packet. So we further analyise with tcpdump command that master just provides the ip of datanode to the client and client is the one who directly upload the data to the datanode and we also prove this just follow the full blog.
Lets analysis the images:
We store the client tcpdump logs in the file client.pcap. when we read logs we got our client(private ip -192.168.42.232) receiving or sending packets from or to datanode(private ip-15.206.80.245) at the time 12.10.51 i.e. client uploading 1st block on datanode(private ip-15.206.80.245) and as we going on further we saw that till 12:10:54, our client(private ip -192.168.42.232) communicate with datanode(private ip- 15.206.80.245). You can also relate this thing with next image.
Look in the image now at 12:10:54 before this marked ip if you notice, you get that client(private ip-192.168.42.232) communicate with datanode(public ip-15.206.80.245) and now if you notice on marked ips you see, our client(private ip- 192.168.42.232) recieving or sending packets from or to datanode(public ip -3.95.162.78) i.e. client starting uploading 2nd block on datanode(public ip -3.95.162.78). As we going on client only communicate with datanode(public ip -3.95.162.78) till 12:10:59 in next image we have a clear picture what’s going on.
If you see before marked line client(private ip- 192.168.42.232) communicate with datanode(public ip -3.95.162.78) and Now you can see client(private ip- 192.168.42.232) receiving or sending packets from or to datanode(public ip -13.233.1.96) at 12:10:59 i.e. now client uploading 3rd block on datanode (public ip -13.233.1.96) This is happened till 12:11:01.
By look on image we get clear image that client(private ip- 192.168.42.232) communicate with datanode(public ip -13.233.1.96) till 12:11:01 and then tcpdump command terminate.
For more verification we run tcpdump command with not 50010port(i.e this time 50010 packets are not deliver or receive by client) in another terminal’
You can see in first image from 12:10:51 to 12:10:54 all the packets are exclude by client and we know at this time first block was uploaded on datanode and same thing happen in next image from time slot 12:10:55 to 12:10:59 at this time 2nd block was uploaded on another datanode and like that at 12:10:59 to 12:11:01 in this time 3rd block uploded on another datanode.
CONCLUSION:
We see all the 3 blocks are uploaded on different different datanode on different different time slots. This proved that blocks data uploaded on datanode serially not parallelly.
As we further analysis tcpdump logs of all the datanode we saw the replica are created of a block on different datanode this thing happen simultaneously
i.e. As client start uploading a block on datanode same time that datanode start making replica on another datanode.
Lets prove it!!!
As we saw above client(private ip- 192.168.42.232) uploading 1st block of file on datanode(public ip-15.206.80.245) so firstly we analysis tcpdump logs of datanode(public ip-15.206.80.245 ).
In the above image at 12:10:51, client(public ip-49.35.103.167) communicate with datanode(private ip-172.31.44.222 ,public ip-15.206.80.245) and the same time in 2nd highlighted line datanode(private ip-172.31.44.222 ,public ip-15.206.80.245) communicate with datanode(public ip-3.95.162.78) this prove that replication create at same time as block uploaded on datanode by client. For further verification we read tcpdump command on which 2nd replica is created datanode(public ip-3.95.162.78).
In the above image datanode(private ip-172.31.44.222 ,public ip-15.206.80.245) communicate with datanode(public ip-3.95.162.78,private ip-172.31.83.185) at 12:10:51 and same time datanode(public ip-3.95.162.78,private ip-172.31.83.185) communicate with datanode(public ip-3.90.160.237) i.e. 3rd replica is created at the same time 12:10:51.
CONCLUSION:
As with same process we read all the tcpdump logs file and me and my team conlculude that blocks uploaded on datanode serially but creation of block replica is going on simultaneously as same as block uploaded on a datanode.
What is tcpdump command ?
tcpdump is a most powerful and widely used command-line packets sniffer or package analyzer tool which is used to capture or filter TCP/IP packets that received or transferred over a network on a specific interface. It is available under most of the Linux/Unix based operating systems.
tcpdump -i network_card_name port_no
For capturing logs of tcpdump, tcpdump command provides many option like -w for writing and -r for reading.
In HADOOP many default ports are listed 50010 portno is only for capturing data transfer packets.
Me and my team showing gratitude to our master @vimaldaga. We enjoying this task hope you love it.
THANKS FOR READING!!!
Visit GITHUB repo for logs file