RHY - Home

A Practical Implementation of Graphs

By Rielly H. Young, July 22nd, 2024

A common difficulty faced by organizations is the inability to quantify the relationship between abstract business objects. A makeshift solution is often the establishment of key performance indicators (KPIs) which, when aggregated, seek to provide quantifiable metrics related to a specific indicator. These KPIs can be used as proxy to compare various abstract objects against one another. But what if you were trying to compare a large number of products to one another, based off qualities that were unable to be easily quantified? What if your company provides a wide number of services across departments, and you wanted to determine which of these had the most overlap?

This was the problem facing my team when trying to quantify the similarity of programs offered by the organization. We had an array of over 400 programs with 30+ properties of various datatypes that needed to be compared to one another. The goal of this comparison was to identify the programs with the highest similarity – as to allow for more efficient direction of organizational resources. Our solution was to build a graph of program objects for further analysis. The following case will implement the same logic to a different set of abstract objects to demonstrate the usefulness of graphs.

Consider a problem where we need to compare companies listed in the S&P500; trying to find the companies that are the most 'alike'. There are many ways one could define 'alike': the products they sell or the services they engage in, the number of employees they have or their market capitalization. All these methods would arguably work fine, but most likely fail to yield any interesting patterns; I would argue, because they are too narrow in scope. Therein lies the problem: how can we quantify a relationship between two or more objects based on the widest variety of their shared properties? This is where the concept of a graph comes into play.

This by no means is an exhaustive explanation of graphs; rather a primer for the purposes of understanding this example. At the most basic level, a graph is an abstract data structure that is made up of a set of nodes (also called vertices) and a set of unordered edges, which represent a connection between two nodes. These can be directed or undirected, but for our purposes we care about directed. These will sometimes be referred to as digraphs. Edges can also have values, which will be important to our implementation of them. Another import aspect of graphs is their means for traversal. While graph traversal can be accomplished with breadth-first search (BFS) or depth-first search (DFS), our implementation utilized the latter. The importance of this will be highlighted in further detail.

So, we understand now that we’re going to build a graph where the nodes represent stocks in the S&P500 - we’re half of the way there. What is missing? The edges. These represent the distance (or similarity in our case) between two nodes. To calculate this distance between two nodes, we implement a summation of each property’s Levenshtein distance. The Levenshtein distance between two strings is the representation of the number of single character edits required to turn one string into another. Our implementation assumed two strings would have a distance between zero (completely different) to one hundred (identical strings). This is an inverse of a typical implementation, but it will make sense as we move on.

Now we have everything we need to make the comparisons: we know what data structure we’re going to use, and we know how we’re going to quantify the edge distances. The rest is quite simple. Below is the C# implementation of a graph, with a few (not all) methods. Full definitions for Obj and NodePair can be found at the end of this document in Appendix A.

 
public class Graph 
{
   private HashSet<Obj> nodes { get; } = new();
   private HashSet<NodePair> edges { get; } = new();
}

While a typical implementation of the graph data structure has more methods that are shown here, only AddNode and AddEdge are important to us. We start out with a graph with no nodes; we iterate through our list of stocks, adding them to the graph one by one. If nodes are empty, we add a new node; if not, we add an edge to edges between the new node and every other node within nodes. Because our graph’s nodes are bidirectional, we create the minimum number of edges required. The code would look something like this:

 
public void Graph.AddNode(Obj node)
{
    if (nodes.Count == 0)
    {
        nodes.Add(node);
    }
    else
    {
        foreach (Obj oldNode in nodes)
        {
            AddEdge(node, oldNode);
        }
        nodes.Add(node);
    }
}

private void Graph.AddEdge(Obj node1, Obj node2)
{
    NodePair np = new(node1, node2);
    edges.Add(np);
}

Distance of the edge is calculated at the edge level (again, which can be seen in Appendix A). Using object reflection, we compare every property of the object (stock in our case). Once the nodes have been added to the graph, determining which nodes are most alike is as simple as providing the GetEdges(double threshold) method a threshold.

For example, if the threshold was set to 85, only edges with values greater than 85 would be returned - in other words, only stock pairs that were more than 85% identical would be shown. Thus, we now have a list containing the most similar stocks in descending order that we can use for further analysis.

 
public List<NodePair> Graph.GetEdges(double threshold)
{
     return edges.Where(edge => threshold <= edge.distance).ToList().OrderByDecending(e => e.distance).ToList();
}

There are five hundred stocks in the S&P500. If we wanted an algorithm to compare each stock to every other one, it’s algorithmic complexity would be O(n!) where n! represents a factorial of the number of stocks (500!). On the other hand, implementing a BFS of our stock graph would have a time complexity of O(|N| + |E|), where N represents the number of nodes (stocks) and E represents the number of edge pairs. From an efficiency perspective, the implementation of a graph based DFS is by far the superior method.

Wrapping up, the implementation of a graph with edge distances calculated by Levenshtein distance is an easy & efficient means for quantifying the relationship between abstract objects. My initial implementation for determining the similarities between programs across our organization. We were able to make connections between programs that hadn’t been previous identified due to the vast quantity of data.

Appendix A.

 
using FuzzySharp; 

public class Obj
{
    public string prop1 { get; set; }
    public int prop2 { get; set;}
    public DateTime prop3 { get; set;}
    
    // remaining properties
}

public class NodePair
{
    public Obj node1 { get; set; }
    public Obj node2 { get; set; }
    public int distance { get; }

    public NodePair( Obj one, Obj two)
    {
        node1 = one;
        node2 = two;
        distance = SetEdgeDistance(one, two);
    }

    private static double SetEdgeDistance(Obj node1, Obj node2)
    {
        PropertyInfo[] properties = typeof(Obj).GetProperties();
        int numberOfProperties = properties.Length;
        double distance = 0;
        for (int i = 0; i < numberOfProperties; i++)
        {
            object val1 = properties[i].GetValue(node1);
            object val2 = properties[i].GetValue(node2);
            if (val1 == null || val2 == null)
            {
                distance = 1; // when both props are null it is treated as identical.
            }
            else
            {
                distance = Levenshtein.GetRatio(val1.ToString(), val2.ToString());
            }
        }
        return distance/numberOfProperties;
    }
}