Balamurugan V Raman: Net Connections Puzzle

A short story:

This problem (given below) was available in gild.com and I started solving this roughly before an year. Couldn't think of anything other then graphs to represent networks and whatever solution I tried failed to scale for larger networks. This is primarily because there is nothing inherent in this problem except for the connectivity between the nodes ( But failed to notice it :) ).

Recently, when I revisited the problem again, something striked in my mind that it is just the connectivity between the nodes and not the structure of the graph. So, it essentially boils down to maintaining sets (essentially disjoint) of connected nodes and then the test for connectivity between any two nodes is just a test for membership of both the nodes in a set. Felt really great and coded this quickly to find that this also fails to scale well !!!

But I was really motivated by the very thought that the sets must be ideal solution to this problem then modelling as graph. So, started looking out for efficient algorithms involving applications of sets for connectivity tests and came across this link. And ... it is done ! It is an improvement over naive (!) set membership algorithm which builds trees of nodes to enable quick union and quick find. Uses path compression logic to keep these trees flat failing to do will lead to tall trees and hampering the performance.

Moral of the story (There should be one right !) :
1) There can be something which is really efficient and captures the essence of the problem rather than the classical representation
2) Trees can be represented in arrays very efficiently(This is a moral for me !!).

Now, the problem ...

You are a network administrator sat in front of your network console. From a log file you can see how your network change and how computers are connected together. You have to decide, quickly, whether two given computers are connected, directly or indirectly, according to the log information.

Each connection is bi-directional. A computer is connected to the computer B if they are directly connected or if there is a path of directly connected computers between the computer A and B.

For instance, let’s say a computer A is connected to the computer B while the computer C is connected with the computer B. Then C and A are also connected (there’s a path of directly connected computers between A and C). Following the previous example, if computer D is now connected with the computer A, then D is now connected to B and C too.

Write a program which reads connection information from a text log file and counts the number of successful and unsuccessful answers to questions of the kind :

is computer A interconnected with computer B ?
Input specifications

Your program will be executed passing, as first parameter of the command line, the name of the file that the program has to open and read.

The very first line of the input file is a positive integer number followed by a new line “\n”. This number indicates how many computers are present into the network.

The file will continue with lines representing either connection information or queries. Connection information lines tell you new information about your network, while Queries are requesting you to answer a question.
Connection lines and Query lines are identified by the first letter of each line (a Connection starts with ‘c’, a Query starts with ‘q’).
Connection line

The connection line is in the form

c N Z

c is the ascii character ‘c’ (indicates a connection). N and Z are two positive integers from 1 to the number of computers declared in the first line of the file. This line means the computer N is now connected to the computer Z and vice versa.
Query line

Queries are in the form

q P Q

q is the ascii character ‘q’ (indicates a query). N and Z are two positive integers from 1 to the number of computers declared in the first line of the file. This line is asking you if the computer P is currently connected to the computer Q. You have to answer this question checking whether the query, at the time when it is encountered, it is successful or not.

Each data set (connection and query) is on a separate line. Connections and queries can appear in any order, regardless of their type. The network state is updated after each line of type “Connection”; each line of type “Query” is processed according to the current network configuration.

Example 1 of input file:

10
c 1 2
q 2 5
c 3 9
q 6 7
c 2 6
q 4 5
c 1 5
c 8 7
c 2 6
c 6 3
c 4 8
c 8 1
c 3 9
c 1 2
q 4 3
c 8 4
c 4 5
q 7 9
c 8 5
c 8 2

Example 2 of input file:

5
c 2 1
c 4 2
c 4 3
c 2 1
c 3 4
c 4 3
q 4 3
c 1 3
c 1 4

Output specifications
============================
Your program must outputs the sum of the connections found (the sum of positive answers to all queries processed) and the sum of the connections not found (the sum of negative answers to all queries processed). These values are separated by a comma (“,”) and followed by a single new line (“\n”).

Output file for example 1:

2,3

Output file for example 2:

1,0

#! /usr/bin/env python

# netconnections

import time
import os
import copy
import numpy as np # arrays are used

# Connections between any two nodes is maintained as
# a map here. If node_1 and node_2 are connected,
# _lookup_cache[(node_1, node_2)] = 1
_lookup_cache = {}

# graph of the network .. stored as adjacency list 
graph = {}

os.chdir(\'C:\\\\XYZ\\\\Desktop\\\\gild\\\\netconnections\')

# look ups count
succ, fail = 0,0

# cache efficiency
hits, misses = 0,0

def readFile(filename):
    s=time.clock()
    for _line in file.readlines(open(filename, \'r\')):
        pass
    t=time.clock()-s
    print \'Time in (sec) to just read the file: \', t


## BFS based lookup - this doesn\'t scale well when the network grows large
## with cache to keep track of graphs visited already    

def isInCache(x,y):
    global hits, misses
    if (x,y) in _lookup_cache or (y,x) in _lookup_cache:
        hits = hits+1
        return True
    else:
        misses = misses+1
        return False
     
def updateLookUpCross(a,b):
    """
    Take the cross product of the two lists processed
    and update the lookup cache
    """
    for t in [(x,y) for x in a for y in b]:
        _lookup_cache[t] = 1

def search(x,y,v,s):
    if x not in graph or y not in graph: return False
    if isInCache(x,y): return True
    global _lookup_cache
    if v is None: v = {}
    while len(s) > 0:
        q=s.pop(0)
        if isInCache(q,y): return True # cache
        v[q]=1 # visit
        if y in graph[q]:
            # update cache for nodes visited so far and
            # elements of graph, becoz all are
            updateLookUpCross(v.keys(), graph[q])
            _lookup_cache[(x,y)]=1
            if q != x: # intermediate path
                _lookup_cache[(x,q)]=1
                _lookup_cache[(q,y)]=1
            return True # done
        for e in graph[q]: # visit children
            # add to stack if not visited already
            if e not in v: s.append(e)
    return False

def processFile(filename, timeit=False):
    global graph
    global _lookup_cache
    global succ, fail
    succ = fail = 0
    try:
        ps = time.clock()
        times = []
        f = open(filename, \'r\')
        maxComps = int(f.readline())
        for line in f.readlines():
            a,b,c = line.split()
            if a == \'c\': # new connection
                #_lookup_cache[(b,c)], _lookup_cache[(c,b)] = 1, 1
                try:
                   if c not in graph[b]: graph[b].append(c)
                except KeyError:
                   graph[b] = [c]

                try:
                   if b not in graph[c]: graph[c].append(b)
                except KeyError:
                   graph[c] = [b]
            else: # query
                s = time.clock()
                vis, stk = None, [b]
                if search(b,c,vis,stk):
                    succ = succ+1
                else:
                    fail=fail+1
                t = time.clock() - s
                times.append(t)
        timetaken = time.clock()-ps
        print \'Using graphs: Whole process took(secs): \', timetaken
    except IOError:
        exit(\'File open failed\')

##actually a simple logic without involving any graph can
##be used to solve this problem.
##Just maintain all the connected nodes in a set
##which will result in disjoint sets for non-connected networks.
##When there is a connection between nodes in two different networks
##merge both the corresponding sets.
##Querying for the presence of a connection is then a matter of looking
##for membership of ***BOTH*** the nodes within a set

## Literature calls this algorithm as UNION-FIND - I din\'t know when I developed it
## probably my first algorithm :)         

## Unfortunately this also doesn\'t scale well ! because every membership search in sets is O(n)
## and this involves extra overhead of adding to a list, removing from list etc.,        

# Collection of sets        
networkNodes = None        
def addConnection(a,b):
    """
    1. For every set maintained, check if both a,b are in a set
    2. Return if 1 is True
    3. Check if a is in any set. Take that set. Else create a set with \'a\'
    4. Check if b is in any set. Take that set. Else create a set with \'b\'
    5. Combine the sets created in 3 & 4.
    """
    global networkNodes
    if networkNodes is None: networkNodes = []
    iCA, iCB = -1, -1 # cluster indices
    for i, c in enumerate(networkNodes):
        if a in c and b in c:
            return
        if a in c:
            iCA = i
            continue
        if b in c:
            iCB = i
            continue

    if iCA is -1:
        networkNodes.append(set([a]))
        iCA = len(networkNodes) - 1
    if iCB is -1:
        networkNodes.append(set([b]))
        iCB = len(networkNodes) - 1

    networkNodes[iCA] = networkNodes[iCA] | networkNodes[iCB] # set union
    networkNodes.pop(iCB)

def queryConnection(a,b):
    """
    1. For every set maintained, check if both a,b are in a set
    2. Return True if 1 is True
    3. Return False if list is exhausted
    """
    global networkNodes
    global succ, fail
    for cluster in networkNodes:
        if a in cluster and b in cluster:
            succ=succ+1
            return True
    fail=fail+1
    return False
    
def processFile1(filename):
    global graph
    global _lookup_cache
    global succ, fail
    succ = fail = 0
    try:
        ps = time.clock()
        times = []
        f = open(filename, \'r\')
        maxComps = int(f.readline())
        for line in f.readlines():
            a,b,c = line.split()
            if a == \'c\': # new connection
                addConnection(b,c)
            else: # query
                s = time.clock()
                queryConnection(b,c)
                t = time.clock() - s
                times.append(t)
        timetaken = time.clock()-ps
        print \'Using sets: Whole process took(secs): \', timetaken
    except IOError:
        exit(\'File open failed\')

## An efficient algorithm which outperforms union-find and which really terminates (!) is given in
## http://www.cs.princeton.edu/~rs/AlgsDS07/01UnionFind.pdf
## Implementing it below ...
class QuickUnionFind(object):        
    " class that implements quick union find with path compression "
    
    def __init__(self, filename):
        ""
        x = open(filename, \'r\')
        self.lines = x.readlines()
        x.close()

        self.id = None
        self.succ = 0 # success look ups
        self.fail = 0 # failure look ups

    def quick_union_find(self):
        ""
        noOfComps = int(self.lines.pop(0))
        self.id = np.arange(noOfComps+1, dtype=np.int32) # nodes marked from 1 to noOfComps

        for line in self.lines:
            a,b,c = line.split()

            b = int(b); c = int(c)

            if a == "c":
                self.add_connection(b,c)
            else:
                self.query_connection(b,c)

    def root(self, n):
        " returns the root of a node [root is the node with its id == 0 or itself] "
        x = n
        while x != self.id[x]:
            self.id[x] = self.id[self.id[x]] # path compression
            x = self.id[x]
        return x
                
    def add_connection(self, p, q):
        """
        -> follows quick union with path compression algorithm
        -> if root[p] == root[q] return
        """
        self.id[self.root(p)] = self.root(q)

    def query_connection(self, p, q):
        " if p and q have same root then they are connected "
        if self.root(p) == self.root(q):
            self.succ = self.succ + 1
        else:
            self.fail = self.fail + 1

def processFile2(filename):
    s = time.clock()
    quf = QuickUnionFind(filename)
    quf.quick_union_find()
    print \'Time taken by Quick Union Find(in secs): \', time.clock() - s
    print quf.succ, \',\', quf.fail

filename=\'c.in\'
readFile(filename)
processFile(filename)
print succ, \',\', fail
processFile1(filename)
print succ, \',\', fail
processFile2(filename)

Just a quick comparison of the three approaches used above:

For a network with 2000 nodes,

Time in (sec) to just read the file: 0.00177850113278
Using graphs: Whole process took(secs): 0.406984690094
119 , 358
Using sets: Whole process took(secs): 0.312078413127
119 , 358
Time taken by Quick Union Find(in secs): 0.114075089609
119 , 358

Balamurugan V Raman

Pages

Search This Blog

December 17, 2012

Net Connections Puzzle

No comments:

Post a Comment