Algorithms and data structures for graph analysis Introduction Software implementations for graph structures can represent a graph as an adjacency list or a sparse adjacency matrix. There are no standards regarding the implementation details of adjacency list or adjacency matrix structures. Notable parties propose different implementations for adjacency lists [1]. Guido van Rossum suggests using a hash table with arrays, where the arrays indicate edges to vertices. Cormen et al. [2] suggest using an array of singly linked lists. Goodrich and Tamassia propose dedicated objects for both vertices and edges, allowing for easily adding information to both vertices and edges. Each of the implementation suggestions has advantages and disadvantages. One can easily come up with more variations, but their trade-offs in the context of graph applications are not clear. Typically the choice of implementation will be dominated by memory/ time trade-offs and the need for dynamic graphs, i.e. the need to alter a graph at random. The size of available datasets is ever increasing. Although bigdata environments allow for massive parallelization, not every graph application can be parallelized; this puts these trade-offs more on the foreground. Behind these observations there is a central question that is relevant to the practice of graph theory: what is the price of using a general framework for creating graph applications? Is it best to think of designing a graph application in terms of a choice between one of the general frameworks (NetworkX , SNAP, Jung, etc.) Or should one see the design of a graph application more as a detailed choice of a set of applicable data structures? A quick answer is: ‘trust the experts’. The expert answer will of course always be: ‘it depends’. This question is relevant for dedicated applications where more performance means more competitive advantage. In this project the impact of data structures on the performance of graph applications is assessed. Both the impact on artificial graph generation and the impact on several graph algorithms will be assessed. These two steps provide a minimal workflow, applicable to many situations. More specific the impact of data structures will be assessed for the following tasks: Synthetic graph generation using a nearest neighbor algorithm [ref paper] Spectral clustering of a graph [3] Calculating maximum flows using the Ford-Fulkerson Document overview This document contains the following sections: A critique motivating the research subject A short overview of relevant data structures with some characteristics A rudimentary performance measurements of relevant data structures in the context of nearest neighbors A memory optimized data structures for graphs A comparison of three data structures in a graph context Performance measurements of synthetic graph generation using a nearest neighbor approach Performance measurements of a spectral clustering algorithm using different data structures Performance measurements of the Ford-Fulkerson algorithm using different data structures Critique A review of major graph implementations shows that implementation decisions regarding data structures are not made explicit. NetworkX [ref website], an elaborate Python based framework, does not mention implementation decisions regarding graph data structures; nor do SNAP [4] or Jung [5]. The primary reason for this is likely to be the fact that all these frameworks are positioned as general frameworks. The qualification of general in this context can be interpreted as ‘flexible’ or ‘broad, without specific aim’. The frameworks aim to offer the broadest support for graph applications, including dynamic graphs. The inclusion of dynamic graphs most likely precludes (or not?) the application of for example static arrays. A more flexible data structure for storing nodes like a maps or ordered maps seems more likely. But then the representation of edges remains open: linked lists, or dynamic lists, like Python lists or C++ vectors, or also a map or a set? In computer science the performance characteristics of different data structures are well known and stressed. Performance characteristics of specific algorithms like for example breadth first search or depth first search are also generally available. The performance trade-offs of data structures in the context of graph applications seem less considered, and therefore less known. The choice of data structure is of course impacted by graph algorithm under consideration; but then most graph applications are written in the context of general frameworks: this completes a circle. As noted before, the trade-offs between data structures become more relevant as the size of available graph datasets grows, and the interests in a competitive advantage are growing. Even if a graph algorithm itself is not parallelizable, one could for example tap into the power of map reduce frameworks to pre-sort datasets, off-loading parts of non-critical work; this is often done in information retrieval work flows. This project aims to assess the trade-offs between data structures for graph applications, with a bias towards applications for larger data sets. First an overview of relevant properties for graph data structures is given. Relevant properties for graph data structures The following general properties can be listed as relevant for graph data structures: Fast random access to nodes and edges Fast iteration over nodes and edges The ability to add and remove nodes and edges Memory efficiency Typically one would like to lookup items in constant O(1) or O(log n) time. Iteration allows for visiting nodes or edges in a single sequence without random access, for example in an iteration context. The ability to remove nodes and edges could be optional in some situations, but should generally be supported. Regarding memory efficiency three notes can be made: • Representing a node ('0') as the first element in a list and deleting that node would result in node '1' being represented as '0'; this is not favorable. One would require 'sparseness'. • Storing edges outside the context of • To efficiently remove nodes for an undirected graph both outgoing and incoming edges should be accessible fast (i.e. fast row and column access) With the properties above in mind three interesting scenarios emerge: Which design is best if optimizing for memory? Which design is best if optimizing for adding and removing nodes and edges? What are the speed trade-offs? These topics will be investigated in the next sections. First an overview of candidates for node and edges data structures will be given. A short overview of relevant data structures In this section the following six relevant data structures are reviewed: (Sparse) matrix Linked list Indexable skip list Dynamic array (vector) Ordered map (tree backed) Unordered map (hash table backed) Judy array This overview is not intended to be exhaustive, but should cover enough ground. At the end of this section an overview of some key properties of these data structures is given. The data structures will be shortly reviewed in the next sections. Full matrices provide a lot of good properties as a data structure for graphs: fast insertion, lookup and deletion. Traversing the (non-zero) elements of a row or column though is not so fast since all elements have to be visited. In practice the high memory requirements often prevents application of a full matrix. Sparse matrices circumvent the drawbacks of matrices, at a cost of course. Sparse matrices, like adjacency lists, do not have a fixed implementation. The requirements for sparse matrices mimic the requirements for graph data structures, as mentioned in the previous section (notably fast row and column access). Any graph data structure will in a way mimic a sparse matrix data structure; because of this the properties of sparse matrices will not be assessed to avoid a recursion of the questions at hand. Implementations of sparse matrices can be based on dictionaries of dictionaries, lists of list, coordinate lists (lists with (x, y) or (x, y, value) tuples). The last representation is popular in map/reduce contexts. A linked list is a chain of value and pointer tuples. Linked lists readily support fast iteration and insertion. Storing the values and the pointers requires extra memory. Since random access is not supported, lookups and deletions require traversal. For this reason ordering the items in the list for example does not help lookup. Skip lists try to fix these drawbacks. Skip lists add additional pointers to values farther down the chain. Lookups can skip large segments of items by ‘jumping’ over them. Adding multiple forward options to elements down the chain can mimic a binary search structure. By adding the number of skipped items per jump, additionally random access of elements by index is possible. This structure is only introduced because of its fit for one the experiments (graph generation). A dynamic array is range of continuous memory, containing elements of a specific type (integer, double, etc.). Contrary to a linked list, a dynamic array allow for index lookup, and allow for a binary search in ordered items. Key point is that deleting items requires moving large blocks of memory. An ordered set or map places elements in a binary tree structure. Lookup, insertion and deletion are reasonable. The tree structure requires a lot of memory Elements can be iterated over in order. A hash map stores elements in an array structure based on the hash value of the element. Collision can be resolved in multiple ways. The elements can not be accessed in order. Judy arrays apply tries to save memory. The design is very complex. I Imagine an n-ary tree where the intermediate (integer) values are woven into a trie. Data structure Matrix Linked list (unordered) Linked list (ordered) Indexable skip list Dyn. array (unordered) Dyn. array (ordered) Ordered map (tree) Unordered map (hash) Judy arrays Insert O(1) O(1) Delete O(1) O(n) Lookup O(1) O(n) Space O(n) O(n) sizeof varies 24 bytes O(n) O(log n) O(n) O(n) O(log n) O(n) O(log n) 24 bytes 32 bytes O(n) O(n) O(n) Exp. O(logn) O(n) O(n) O(log n) Exp. O(1) O(n) O(log n) O(log n) O(log n) O(n) O(n) 24 bytes 24 bytes Exp. O(1) Exp. O(1) O(n) 40 bytes 24 bytes In the next sections some rudimentary performance measurements of relevant data structures is presented. Rudimentary performance measurements of relevant data structures Rudimentary speed and memory characteristics of the previously mentioned data structures are assessed in this section. The following characteristics are assessed: • Memory consumption (figure 1) • Creation speed of data structure (figure 2) • Iteration speed over data structure (figure 3) • Lookup speed of random elements in data structure (figure 4) • Deletion speed of elements in data structures (figure 5) During measurements sets were used rather than maps to accentuate the overhead of index structures rather than extra dictionary values. This was not possible for Judy arrays. During the experiments the C++ wrapper of Judy arrays required lots of debugging. Creating a larger amount of Judy arrays required a lot of memory. For this reason Judy arrays were dropped. As an alternative to a dynamic array a deque was added to the measurements. A deque supports fast addition of elements to the front and back; this is made possible by not storing the elements in contiguous memory, but in contiguous blocks. This property does not apply in the context of this study, further considerations is drop. Sparse and dense maps and sets as implemented by Google were added to the measurements in a later stage. These data structures use bins, like hash maps and groups. Time was not available to investigate the structures in more detail (this is swapped for other activities). The ability to reserve memory upfront indicates that that some type of array structure is involved (probably involving groups/bin of length magic number 48). Lookups in dynamic arrays (and deques) are based on a binary search in ordered values, not on index access due to reasons of sparseness. Two compare the creation of ordered dynamic arrays using insertion sort and post sort, an additional measurement was performed (figure 6). Due to brevity reasons the results will not be discussed. Based on the results two candidate data structures will be presented in the next sections. Figure 1 Memory consumption Figure 2 Creation speed of data structures Figure 3 Iteration speed of data structures Figure 4 Lookup speed of data structures Figure 5 Deletion speed of 10000 random entries Figure 6 Creation of ordered dynamic arrays (dense set added for reference). Tooling For this study the following tools were used: • Ubuntu 14.04 • C++/gcc • Workstation Xeon 12 core, 2.7 Ghz, 24Gb RAM, 6Gbps disk A memory optimized candidate for data structures This section is dense as it involves the synthesis of a lot of information. For reasons of simplicity directed graphs are considered. First the memory optimized candidate is presented, second a short note on a read/write optimized candidate. Memory optimized candidate As the previous measurements show, dynamic arrays are efficient in storing data; as efficient as it gets. Some initial observations are: • In the memory measurements there seems to be some memory overhead, this is likely some fragmentation that is not immediately cleaned up; it will be ignored. • Node id-s can be assumed continuous (ignoring deletions) as it can be deferred to off-line processing. • Insertions into and deletions from vectors of moderate length (10.000 – 100.000) is still moderately possible (this range is what one might call 'Pythonistan') • • • • • • • • Using primitive types to store data in dynamic arrays prevents C++ object structure overhead If power laws are involved then using a separate vector to store the edges of a node might present unwanted overhead. The 'sparseness' argument requires at least storage of the node id-s itself apart from storing the edges. If storing very large graphs the allocation of very large contiguous memory might cause memory management issues. For the candidate deletion of nodes and edges is not considered optional, since this would limit the application space. Applying variable byte encoding like in search application is not favorable since indexed access is required The sparse hash map can store elements with an overhead of one bit Lookups in sorted vectors using binary search is relatively slow Based on the observations above, the following progression of memory optimized data structures can be made. A simple data structure would store the node id-s in an array, and store the edges of the nodes in separate arrays (one per node, the edges are sorted), accessible via a central array of pointers to those arrays. If creating a vector per node for the edges is to much overhead then the following progression can be made. The edges are not stored in separate arrays, but are stacked by order of node id in a single array. Since this could result in a huge array for the edges, a further progression can be made to help memory management. Based on the node id-s bins are made using the modulo operator. The node id-s are stored in sorted arrays, one per bin. These arrays are accessible via an array of pointers, which length is based on the modulo argument. The edges are stored sorted in a separate array per bin and stacked in order of the node id-s. A third array (per bin) stores the starting index of the nodes edges (i.e. the third node in the bin is '5', the edges of which start at index '45' of the edge array). Since nodes have to be looked up in the node arrays using binary search, further refinement is needed, as the measurements indicate. Instead of using a separate array for the node id's and the edge indexes, a single sparse hash map (SHM) can be used to store both the node id-s and the edge index at once. As the measurements show, this is fast and efficient. Additionally the edge array can be discarded. Since internally the nodes in the sparse hash map are not stored continuously, binning is not necessary for nodes. The edge array though can still be broken down in bins: this is likely to be the largest structure. Insertion of nodes would require appending to the sparse map; this should not be an issue. Insertion of edges would require updating the array with the edge indices, since the the edges are stacked per node. This is not efficient. But in one case though: • assuming more or less continuous node id-s • assuming a moderate range of edge degrees • setting the modulo argument to the number of nodes it is possible to land in 'Pythonistan' since every node has it own edge list. On appending a node: • the modulo argument has to be incremented • an extra edge edge array has to be created This is not an ideal situation, but might be manageable. Deletion requires a deletion from the node in the sparse map (fast as the measurements show) and a deletion from the edge arrays. Deletions from the edge array could be supported reasonably if the edge array is stored in in sparse hash map too and the reverse edges are stored for fast incoming edge lookups. This would probably need a 'landing edge' that is never deleted for edges in the edge array. Random edges lookup can not use binary search to lookup random edges if the edges are stored in a sparse map; this possible if each node has a separate sparse hash set: this requires more memory but might be an option for the speed optimized candidate. The resulting data structure allows optimization under several circumstances and supports a wide range of applications. In coming section the properties of this structure will be assessed. Interestingly the structure is some form of a map of maps. Speed optimized candidate Additionally the notion of speed is less uniform than the notion of memory. In light of the results in the previous section, the rest of this project the memory optimized candidate will be compared to SNAP instead, making adjustments if necessary. A comparison of data structures in a graph context In this section the memory optimized candidate (MOC) is benchmarked to SNAP for several graph related tasks. First a table of results is presented, followed by some comments. The measurements are intended to test the data structure requirements as stated earlier, and might reflect general best practices. Structure Task Dataset Results SNAP Read data set LiveJournal 48s MOC SNAP 6.3s Memory LiveJournal 1 bin 300MB MOC 2^14 bin 302MB SNAP Lookup 1M random nodes MOC Node arrays+bin. search MOC Nodes in SHM SNAP SNAP Power iteration 0.13 1854MB MOC reverse Comp. 0.16 +300 MB LiveJournal 0.11 84s 0.12 LiveJournal 1.1 20.5s Native using Lanczos 10.08 1 bin 13.6s 0.66 MOC SNAP Remove 1M nodes - - MOC 14^2 bins - - SNAP Create BFS tree MOC LiveJournal ** 6.7s 0.56 The datasets used were taken from SNAP [5]: • LiveJournal social network (4847571 nodes/68993773 edges) Since the MOC does not build up any binary tree structures (BTS) in memory for storing the nodes, reading datasets is very fast. Two preprocessing tasks are required though for this speed: • a list of unique node id-s with edge counts (a preprocessing task) • the input file consists of edges sorted per node (a preprocessing task) Note that the graph is directed and that no reverse edges are stored. In the MOC memory measurement the measured quantity was DRS memory. The power iteration shows that the MOC is fast enough iteration wise. Note that the SNAP power iteration was a simple one, created for this measurement. The SNAP power iteration iterated over the edges, the MOC power iteration iterated over the nodes first. This is a necessary step that does not seem to hurt. Native Lanczos in SNAP is faster. Removing edges is not implemented due to time contstraints. The creation of the BFS tree can be done fast in MOC by avoiding the usage of set like structures. First an appropriate empty node and edge structure is copied were the node id-s are marked using a special marker (MAX_INT for example). With the sparse representation this is not too bad, it requires about as much memory as a set containing a 95% SCC (as can be concluded from the measurements, and there might be some logic behind this), but it is considerably faster. Next during the BFS each marker is replaced by the number of valid outgoing edges, if the node is visited. If a node has been visited, the edge is marked (MAX_INT for example), indicating is has been surpassed. Finally the node structure is compacted. Intermediary notice Up to this point the project has been updated after the milestone. The next results have not changed since the milestone, and are presented as is; the conclusion exempt. Performance measurements of synthetic graph generation In this section the performance characteristics of synthetic graph generation are assessed. To generate the data a nearest neighbor approach as described by [ref] is used. A critical part of the generation process involves randomly choosing a potential connection from a large set of potential connections. Two strategies are explored: storing all potential edges in one set (‘global’), and storing potential edges as ‘local’ sets connected to the ‘central’ node of the potential connections. In light of the previous measurements, nodes are stored as elements of vectors since during graph creation nodes are ‘append’ only. For storing the potential connections random access skip lists are used, since this structure seems to fit the requirements best. This might come with a hefty memory penalty. Note that although maps allow for faster lookups and deletions, these structures do not allow for indexed lookup, which is favorable for random draws. In the simulations u=0.5 and k=1. Outside the moderate range, memory/speed trade offs are unavoidable, the choice of data structures have a big impact on the performance of the graph generation algorithm. The trade-off to be made would likely be between random access skip lists for speed and vectors for memory efficiency. But this only relevant if the number of potential edges exceeds the moderate range; for a lot of cases using for example vectors to store potential connections can suffice. Using more specific knowledge of u one might be able to optimize memory requirements by using a local strategy and storing either the unconverted potential connections or the complement. Performance measurements of spectral clustering One can apply spectral clustering using the power method to find the eigenvectors of a Laplacian matrix representation of a graph. Let it first be noted that map reduce is specifically fit to perform power iterations over large graphs. This fact greatly diminishes the applicability of performance assessments in this section. In this assessment the NJW [3] method will applied, using the largest eigenvector. After the eigenvectors are found, the nodes are clustered using a k-means algorithm initialized with kmeans++. The nodes are stored as vectors. For the iteration of edge values one must then store both a column index and the value itself. Four implementation options seem valid for storing the edges: A linked list of column-value pairs Two parallel linked lists with column and values A dynamic array of column-value pairs Two parallel dynamic arrays with column and values A very small traversing experiment shows that the last structure is the dominant choice. i.e. both speed and memory wise. Since maps also allow for iterator access, the performance of maps is considered equal to the performance of a linked list with objects. This should ne comparable to the single linked list implementation; the performance of which is comparable to the data structure of choice. Hence, a maps-ofmaps generic graph structure is likely to yield similar performance (requiring more memory of course as shown earlier). Crunching two parallel dynamic arrays set-up a bit more, the final data structure for the assessment contains just three lists: One list with the cumulative number of edges over the nodes One list with block wise the edges per node (addressable using the first list) One list with block wise the values per edge in the same order as the second list. A power iteration now involves the iteration over three consecutive memory blocks. This sequential pattern shows that the computation can be performed parallel, using for example map/ reduce. A similar graph generation set-up is used as in the previous section. Performance measurements Ford-Fulkerson In this section the impact of data structures on the performance of the Ford-Fulkerson maximum flows algorithm is assessed. Unlike the algorithm in the previous section, this algorithm is not as easily parallelized; which provides an opportunity for improvements. Two different implementations of the algorithm will be compared. The first version uses as data structure a map of maps to represent a weighted graph, the second version uses the compact data structure with three vectors as mentioned in the previous graph. The algorithm requires the random access on the edges of a node. As shown at the beginning, one would expect maps to be faster. Implementatio n Maps of maps Three vectors Iterations Nodes ms 10000 10000 6 6 760 70 A non-representative micro experiment shows that solving a small 6 nodes flow network 10000 times is actually done faster by the graph with three vectors. Clearly the seek time to assess the edges weights is not dominating the time taken yet. The probability that this result is real is very low. Even if proven valid, generalizing the results should be done with utter care. Conclusion In the introduction it was stated that textbook implementations leave ground uncovered. In this project it was shown that newer data structures (notably from Google) offer different possibilities, that allow for the design of new solutions. These designs are relevant in contexts were performance delivers an competitive advantage. It was shown that dynamic arrays do a decent job in the moderate range; dynamic arrays must have been very well optimized on all levels. The point above explains much of the success of Python and its list structures. It seems that there new opportunities in ‘extremistan’ trade-offs are hard and shall be made consciously. C++ turned out a good type of environment for this kind of assessment; though implementing software in C++ is not trivial. The information presented in this context is not new; accessing this information requires extensive search, and evaluation. Project statistics C++ code: 1400-1600 lines Python POC code: similar to milestone (600-700?) Gnuplot code: 200-300 lines A short note on a bumpy road These project results are at the end of a bumpy road in many ways. Implementing software in C++ proved hard; harder than in Python for example. C++ might not have been the right choice productivity wise. The language does provide opportunities, as was hopefully shown. Looking back at the midterm expectations I can conclude the following. Implementing the software so far proved harder than expected. But finding new technical opportunities and molding these into a design is hard, especially of the beaten track. I had a hunch of this, but not strong enough. I underestimated the time involved for the extra requirements as presented by the TA. These requirements fall more into the space of graph analysis itself. I can see the reasons for the requirements, and would have liked meeting those. This should in my opinion not diminish the value of the hard work put in and the results: this is the central question though. This is my seventh course at Stanford. Compared to other courses, I put in as much time for this course as for the other courses. To make progress with the project (and for personal reasons), I took more than four weeks leave, and even ended up terminating my contract. As stated, a bumpy road. Time ended up being available by setting priorities. From my part I am happy to have taken the challenge. Compared to the other courses, in this course I reached far to get into more serious space regarding bigger data crunching. References [1] “Adjacency list” Wikipedia. Wikimedia Foundation, accessed 10 december 2014. [2] Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and Charles E. Leiserson. Introduction to Algorithms(2nd ed.). McGraw-Hill Higher Education, (2002). [3] Ng, A., Jordan, M., Weiss, Y.. On spectral clustering: analysis and an algorithm. In: Dietterich, T., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14, pp. 849- 856. MIT Press, Cambridge (2002). [4] Leskovec, J. Stanford analysis project. Stanford University, accessed 10 december 2014. [5] “Java Universal Network/Graph Framework”. Sourceforge.net, accessed 10 december 2014.

1/--pages