Monday, October 14, 2013

Best Way of Running Parallel Wgets


What is the best way of running parallel wgets? So far I've discovered two methods(are there more?), but I'm unsure of the pros and cons of each. 

method 1: 

urllist.txt 
--http-user=user1 --http-password=pass1 -0 file1 https://site1.com 
--http-user=user2 --http-password=pass2 -0 file2 https://site2.com 
--http-user=user3 --http-password=pass3 -0 file3 https://site3.com 
--http-user=user4 --http-password=pass4 -0 file4 https://site4.com 
--http-user=user5 --http-password=pass5 -0 file5 https://site5.com 
--http-user=user6 --http-password=pass6 -0 file6 https://site6.com 


#!/bin/sh 
URL_LIST=`cat urllist.txt` 
echo $URL_LIST | xargs -n 1 -P 80 wget -q 


method 2: 

urllist.txt 
wget --http-user=user1 --http-password=pass1 -0 file1 https://site1.com 
wget --http-user=user2 --http-password=pass2 -0 file2 https://site2.com 
wget --http-user=user3 --http-password=pass3 -0 file3 https://site3.com 
wget --http-user=user4 --http-password=pass4 -0 file4 https://site4.com 
wget --http-user=user5 --http-password=pass5 -0 file5 https://site5.com 
wget --http-user=user6 --http-password=pass6 -0 file6 https://site6.com 


#!/bin/sh 
while read line 
do 
eval "(${line})&" 
done <urllist.txt 


Does anyone have an opinion on which is the best method (speed, resources, contention, etc)?

My testing shows that for large number of request the first method using xargs is better. 
It allows you to set the number of parallel processes (I wouldn't use the 80 you've specified, I'd use maybe 20) that will occur at any given time, queuing the others. 
This allows the operating system to de-schedule them to run until more input becomes available (to avoid eating up the CPU) if they are waiting on a http response. 
This stops your system being overloaded, and potentially thrashing. 
The second method, just launches then all at once, load be damned. 
If I was worried about load, memory, and making the system behave, the xargs method plays much nicer. Better yet, use nice infront of each wget to be really nice if you launch a lot of processes.
Method number one is sequential and method number two is parallel


0 comments:

Post a Comment

 
Design by BABU | Dedicated to grandfather | welcome to BABU-UNIX-FORUM