Skip to main content

Towards incremental graph mining on streaming graphs

Resource type
Thesis type
(Thesis) M.Sc.
Date created
Analyzing the structural properties of graphs is important in various domains including bioinformatics, malware detection, and social network analysis. The fast-changing nature of real-world graphs demands efficient solutions to analyze them in real-time. While general-purpose graph mining systems have been developed to analyze static graphs, there is no comparable solution for mining insights from dynamic graphs. Existing application-specific streaming solutions store intermediate results and explore extraneous matches that are not relevant for the final results, degrading their overall performance. In this thesis, we present Protean, the first general-purpose graph mining system for streaming graphs. Protean incorporates two key components. First, a novel differential pattern matching engine that directly finds useful subgraphs resulting from graph updates and operates independently from the snapshot matching engine. And second, a dynamic processing model that captures cross-pattern dependencies for dynamic subgraph exploration, and utilizes an efficient multiversioning strategy to avoid exploration of automorphic matches. For applications where the patterns of interest are known apriori, the differential pattern matching engine constructs an efficient pattern exploration plan to find matches affected by the graph updates. On the other hand, applications that dynamically determine the patterns of interest often demand exploring the data graph from scratch in order to generate results for new patterns of interest. For such applications, Protean tracks cross-pattern dependencies using a Pattern-Dependency DAG (P-DAG for short) and dynamically invokes the right matching engine (differential or snapshot) based on the impact of graph update. As the matching tasks arising from the two matching engines demand different computing power, Protean parallelizes the matching tasks from the two engines in different ways in order to maximize performance. Our evaluation shows that Protean achieves low latency response to updates, often computing fresh results in less than a millisecond, which is crucial for continuous graph mining. As the existing purpose-built streaming graph mining solutions take minutes or even hours to process updates, Protean's update-driven parallel processing model enables orders of magnitude better performance, and scales to graphs with billions of edges.
37 pages.
Copyright statement
Copyright is held by the author(s).
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Vora, Keval
Member of collection
Download file Size
etd21333.pdf 2.69 MB

Views & downloads - as of June 2023

Views: 0
Downloads: 0