Python Grid Engine

29th Jan 2009 python scad bash grid engine ip scanner platform independent port scanner renderfarm ssh

I have been working on this grid engine script for some time now.

Its based on Python and as less third party libraries as possible to keep it platform independent. It works by connecting to each computer thru SSH and sending command line renders or any other job that can be distributed over SSH command line.

I know there are a lot of renderfarm scripts out there but they are all the same in principle, user needs to output renderer information files such as ifd, rib, mi2 on a workstation before sending the job to the farm. This takes a huge amount time and is very inefficient since all these files have to be distributed to the render nodes before the farm can even start rendering.

Now imagine doing this with a huge simulation thats cached to the hard drive. I usually end up with cache folders over 10 gigs and its impossible to send it to a renderfarm. My code deals with all these inefficiencies by storing everything on a network share, connecting to the render nodes thru SSH and starting the render from the shell of the renderer avoiding the creation and storing of the renderer information files frame by frame.

The following code snippet starts a child process and executes the SSH client, connects to the computer and sends in as many commands as needed:

def remotehoudinicmd(user, remotehost, cmd1, cmd2, cmd3):
    pid, fd = os.forkpty()
    if pid == 0:
        os.execv("/usr/bin/ssh", ["/usr/bin/ssh", "-l", user, remotehost] + cmd1 + cmd2 + cmd3)

And this part of the code uses the procedure above to fill in the variables to source a Houdini shell and start a render thru the hrender command:

houdinilinuxlocation = "cd /opt/hfs9.5.303" + "\n"
houdinisourcefile = "source houdini_setup" + "\n"
houdinirendercmd= "hrender -e -f " + startframe + " " + endframe + " -v -d " + rendernode + " " + filepath
print "command: " + houdinirendercmd
remotehoudinicmd(username, remotehostip,  [houdinilinuxlocation],  [houdinisourcefile],  [houdinirendercmd])

Right now I'm working on error reporting and running this process that connects to render nodes and starts renders locally. I will also code or use a third party port scanner to find SSH servers running on the local subnet.

There are other things in the To Do list as well such as storing the number of processors and amount of ram the render nodes got with their individual MAC address, and using this list to send less or more frames to different nodes to make sure render job runs as efficiently as possible. Also a procedure that will collect error data from the local jobs and re-assign that part of the job if its possible or the whole chunk again to another or the same node.

I am not releasing the code for this yet, as it is not even in alpha status. Updates to this code will be edited into this entry instead of new ones.

UPDATE

IP scanner is in working condition, one little problem is that the computers within the given ip range have to be online, otherwise this script takes forever to run. I'll try to implement multithreading to this procedure to make it faster.

What the script does is simple, it takes to two IP addresses, unpacks them to a tuple, takes the first value of the tuple and converts it to an integer so they can be incremented in a loop. what the loop does it to pack the IP back into its x.x.x.x form and connect to it from port 22 which is the SSH.

Loop uses connect_ex function which doesn't really connect to the server but returns an error value if it can't connect. If it can connect it returns zero. If it returns zero, the ip address gets written down to a file to be read later on by the distribution procedure, and far down the line by another SSH procedure that records the configuration of the computer to be used by the distribution procedure again. Once its done with the socket, it closes the port, unpacks the IP, converts the tuple back to an integer, increments it and loops. Here is a snippet of the procedure:

def portscan(start, stop):
    import struct
    import socket
    import os
    unpackedstart = struct.unpack('!I',socket.inet_aton(start))
    unpackedstop = struct.unpack('!I',socket.inet_aton(stop))
    unpackedstart = unpackedstart[0]
    unpackedstop = unpackedstop[0]
    while unpackedstart <= unpackedstop:
    ip = socket.inet_ntoa(struct.pack('!I', unpackedstart))
    from socket import *
    socketobj = socket(AF_INET, SOCK_STREAM)
    result = socketobj.connect_ex((ip,22))
    socketobj.close()
    databasefile = open('/database', 'a')
    if result == 0:
        entry = ip + "\n"
        databasefile.write(entry)
        print ip
        databasefile.close
        import socket
        unpackedstart = struct.unpack('!I',socket.inet_aton(ip))
        unpackedstart = unpackedstart[0]
        unpackedstart = unpackedstart + 1

UPDATE

Time for an update, a major one. I figured threading in Python but as always hit a brick wall as soon as I worked my way around the threading problem. I am not sure if it was Python or the OS, but apparently one of them is not too big on opening the same file a few hundred times at once. So checking a large number of IP addresses at the same time was impossible. So I decided to run IP checks in batches of 50. Python script spawns 50 threads to check the network nodes, pauses for half a second, then spawns another 50 threads. I opted for the pause function instead of waiting for the thread to end and spawning a new one in its place, because it does almost the same job with a lot less code, and when your program breaks and you need to debug, you always want less, simpler code. Here is the newest port scanning part:

class portscan(Thread):
    def __init__ (self,ip,databasefile):
        Thread.__init__(self)
        self.ip = ip
        self.databasefile = databasefile
        self.status = -1
    def run(self):
        database = open(self.databasefile,'a')
        socketobj = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        socketobj.settimeout(.1)
        result = socketobj.connect_ex((self.ip,22))
        socketobj.close()
        if result == 0:
            arg1 = Popen(["ping", "-c 1", "-t 1", self.ip], stdout=PIPE)
            arg2 = Popen(["grep", "time"], stdin=arg1.stdout, stdout=PIPE)
            arg3 = Popen(["cut", "-d", "=", "-f4-"], stdin=arg2.stdout, stdout=PIPE)
            arg4 = Popen(["sed", "s/.\{3\}$//"], stdin=arg3.stdout, stdout=PIPE)
            entry = self.ip + "\n"
            database.write(entry)
            print str(self.ip) +": OPEN"
        database.close

And the part of the python script that loops it is here:

print "Enter IPs to start and stop scanning:"
startip = raw_input("Start IP: ")
stopip = raw_input("Stop IP: ")

databasefile = str(sys.path[0])+"/nodeDB"
print databasefile
if os.path.exists(databasefile) == 1:
    os.remove(databasefile)
unpackedstart = struct.unpack('!I',socket.inet_aton(startip))
unpackedstop = struct.unpack('!I',socket.inet_aton(stopip))
unpackedstart = unpackedstart[0]
unpackedstop = unpackedstop[0]

batchcounter = 0
divideby = 1
totalthreads = unpackedstop-unpackedstart
if totalthreads >= 50:
    divideby = totalthreads/50
batchthreads = totalthreads/divideby
batchstart = unpackedstart
batchstop = batchstart+batchthreads
while batchcounter < divideby:
    print "batchcounter" +str(batchcounter)
    while batchstart <= batchstop:
        checkip = socket.inet_ntoa(struct.pack('!I', batchstart))
        breakip = struct.unpack('!I',socket.inet_aton(checkip))
        breakip = breakip[0]
        if breakip == unpackedstop:
            break
        threadcreate = portscan(checkip,databasefile)
        threadcreate.start()
        batchstart = struct.unpack('!I',socket.inet_aton(checkip))
        batchstart = batchstart[0]
        batchstart = batchstart + 1
    batchcounter = batchcounter + 1
    breakip = struct.unpack('!I',socket.inet_aton(checkip))
    breakip = breakip[0]
    if breakip >= unpackedstop:
        break
    batchstop = batchstart + batchthreads
    pause()

I also have some code to run as a threaded version of the previous Houdini render command, but my new road block is SSH authentication. Once I work around SSH problems, code should be ready to test. Only things left in the TODO list will be queue management, error checking, and adding a few more presets for other command line renderers.

Previous Post Next Post