Python crawler controller

 work on a project where I have written 20+ crawlers and the crawlers are running 24/7 (with good amount of sleep). Sometimes, I need to update / restart the server. Then I have to start all the crawlers again. So, I have written a script that will control all the crawlers. It will first check if the crawler is already running, and if not, then it will start the crawler and the crawler will run in the background. I also saved the pid of all the crawlers in a text file so that I can kill a particular crawler immediately when needed.

Here is my code :

site_dt = {‘Site1 Name’ : [‘’, ‘site1_crawler.out’],

location = “/home/crawler/”

pidfp = open(‘pid.txt’, ‘w’)

def is_running(pname):
p1 = Popen([“ps”, “ax”], stdout=PIPE)
p2 = Popen([“grep”, pname], stdin=p1.stdout, stdout=PIPE)
p1.stdout.close() # Allow p1 to receive a SIGPIPE if p2 exits.
output = p2.communicate()[0]
if output.find(‘/home/crawler/’+pname) > -1:
return True
return False

def main():
for item in site_dt.keys():
print item
if is_running(site_dt[item][0]) is True:
print site_dt[item][0], “already running”
cmd = “python ” + location + site_dt[item][0] + ” -l info”
outfile = “log/” + site_dt[item][1]
fp = open(outfile, ‘w’)

pid = Popen(shlex.split(cmd), stdout=fp).pid

print pid
pidfp.write(item + “: ” + pid + “n”)


if __name__ == “__main__”:

If you feel that there is scope for improvement, please comment.

jQuery(document).ready(function($) { $.post('', {action: 'wpt_view_count', id: '555'}); });

Leave A Reply

Your email address will not be published.