# Scenarios

**Notional Scenarios for the 2017 HIVE Graph Challenge**

In this era of big data, the rates at which these data sets grow continue to accelerate. The ability to manage and analyze the largest data sets is always severely taxed. The most challenging of these data sets are those containing relational or network data. The HIVE challenge is envisioned to be an annual challenge that will advance the state of the art in graph analytics on extremely large data sets. The primary focus of the challenges will be on the expansion and acceleration of graph analytic algorithms through improvements to algorithms and their implementations, and especially importantly, through special purpose hardware such as distributed and grid computers, and GPUs. Potential approaches to accelerate graph analytic algorithms include such methods as massively parallel computation, improvements to memory utilization, more efficient communications, and optimized data processing units.

The 2017 HIVE challenge is composed of two challenges: the first focuses on subgraph isomorphism and the second on community detection. The baseline algorithms for the first challenge are recently developed algorithms that find triangles and k-trusses (J. Wang 2012). The triangle counting algorithms can be considered as a special case of subgraph isomorphism where the subgraph of interest is restricted to a triangle. Although these algorithms do not find matching subgraphs of a general description, they can be used as components in algorithms that do. K-truss search algorithms can potentially support subgraph isomorphism algorithms through the characterization of a larger graph and a subgraph of interest. Inconsistent k-truss features prove that an isomorphism does not exist between two subgraphs while consistent k-truss features indicate a region in the larger graph that requires additional examination to determine if an isomorphism exists. The goal for this first challenge will be to improve the efficiency of the triangle counting and k-truss algorithms while researchers develop new subgraph isomorphism algorithms that can leverage triangle counting and k-truss algorithm results.

Triangle counting algorithms that challengers may wish to examine include the mini-tri algorithm, developed by Wolf et al. (Wolf 2015), algorithms by Burkhart, et al. (Burkhardt 2016) and Azad, et al. (Azad 2015), and a MapReduce implementation by Cohen et al. (Cohen 2009). A comparison of triangle counting algorithms is available in (L. Wang 2016.). A k-truss detection algorithm is provided to identify k-trusses (Cohen 2008). Benchmark algorithms for triangle counting and k-truss detection are included in the challenge data package. Competitors may use these algorithms or other algorithms of their choice and accelerate them through improvements to hardware or firmware systems of their choice. Performance will be assessed based upon problem characteristics and performance measures. The problem characteristics and measures include the total number of vertices and edges in the graph, the number and types of processors used, the execution time, the total energy consumption in Watts, and the memory required for the computation. Secondary measures estimated from the primary measures are the processing rate in edges traversed per second and the processing rate per unit energy (edges/sec/Watt). The number of triangles and k-trusses should be reported for each run. The expectation is that any correctly implemented algorithm will determine the correct number because the provided baseline algorithms are deterministic.

The second challenge is to identify communities in large graphs using the distribution of edges in the graphs as features. The standard definition of communities is applied here, in that communities have greater numbers of edges between the members of the community than they do to other vertices in the graph. This challenge is configured to be a streaming graph challenge and therefore the algorithms are expected to have a real-time aspect to their implementation. Streaming generally means that data flows through a communications channel to the algorithm. The algorithm then simultaneously ingests the real-time data, performs analytics, and reports results as they are obtained. A variant of streaming data is used for this challenge, where data are provided in blocks that the algorithm can process and produce results. The algorithm will then request the next block to repeat the cycle until all data are processed. A baseline graph partitioning algorithm, originally developed by T. P. Peixoto is provided (Peixoto 2012) (Peixoto. 2013) (Peixoto 2014). This algorithm uses a stochastic blockmodel, Markov Chain Monte Carlo techniques, and parallel computing to identify communities. Competitors are free to use other algorithms and other hardware to improve performance. Because of the stochastic nature of this challenge, competitors will be evaluated on their accuracy as measured through a number of statistical and information theoretic measures in addition to the computational measures for the triangle and k-truss counting challenge. Performance measures should be collected at all stages in the streaming process. Measures of the number of vertices and edges should be recorded, was well as the number and types of processors. The execution time at each stage, as well as the total energy consumption in Watts and the memory requirements, should be recorded. The challengers should construct a contingency table from the provided truth data and the results of their calculations. Measures generated from the contingency table include overall accuracy, block-wise precision and recall, pairwise counting metrics, including the Rand index, adjusted Rand index, and pairwise precision and recall. The information theoretic measures include information precision and information recall, defined as the mutual information between the truth and the reported results divided by either the entropy of the reported results or the entropy of the truth.

The 2017 challenge algorithms do not directly address many of the graph analytics needs of government agencies, but are foundational algorithms that may be components in more complex algorithms that can address the need to rapidly analyze extremely large data sets. The challenge algorithms for this year are anticipated to be useful for a fast initial assessment of the structure of large networks and detection of community structure within the networks. For example, the two algorithms can be used to quickly filter out uninteresting subgraphs and other structures before performing multivariate data analysis on the attributes of interesting subgraphs with slower, more complicated algorithms, thus reducing the need to perform detailed analyses on large graphs.

A number of possible scenarios can be imagined where improved graph analytics arising from the two HIVE graph challenge problems could be valuable. Community detection is a common class of network analytics and is used in a broad range of applications. Specific examples are terrorist and criminal network detection, insider threat detection, humanitarian assistance, and discovering and monitoring bad actors within the dark web. For 2017, the following notional scenarios are provided to give a sense of how the challenge algorithms may support more complex analyses of potential utility to corporations, government agencies and international organizations. Note that the 2017 challenge data sets ** do not** contain data specific to the described notional scenarios. In future challenges, the organizers plan to develop scenarios that will be reflected in the generated data sets.

*Note: The following scenarios and all people, places, groups, technologies, and events contained therein are fictional. Any resemblance to real people, places, groups, technologies, or events is purely coincidental.*

## Cyber Scenario:

The International Consortium of Internet Service Providers (ICISP) expects that the rapid growth of the Internet of Things (IoT) will radically alter the structure of the internet, which is estimated to have already added six billion nodes and growing daily. There is concern that world-wide deployment of these products will reduce the security of the internet, potentially exposing the consortium members to litigation for failure to protect customers. Without action, it is predicted that cyber criminals and unscrupulous corporations will exploit the new structure and scale of the Internet of Things for nefarious purposes, such as botnet attacks to influence markets and to limit competitors’ access to customers. ICISP is in the process of implementing a global internet traffic monitoring system. It requires new algorithms for the timely assessment of internet community structure so that analysts can identify clusters that are likely to have been enlisted into botnets or are currently participating in cyber-attacks and influence operations. You are to construct a system that can quickly filter large graph datasets and generate output for follow-on analyses by other algorithms that will examine sub-network attributes to detect potential cyber-crime activity.

**Challenge 1:** The objective of this challenge is to perform a fast characterization of the devices on the internet to support the identification of potentially susceptible devices and critically important sub-networks and to detect characteristic botnet connectivity patterns. Triangle detection and k-truss algorithms are able to run in polynomial time and thus have an edge over a number of other algorithms for quick characterization of the connectivity patterns of nodes on the internet. Hardware and firmware acceleration may also be required to provide characterization information in a timely manner. Challengers are welcome to consider other algorithms that can provide an early characterization of devices on the internet.

**Challenge 2:** Because the internet of things continues to grow at a rapid pace, a streaming analysis of the partitions in the network would be valuable to reduce the response times of the ICISP cyber security teams. The teams will be able to begin more detailed analyses as the partitions are identified instead of waiting for full data ingest and processing to be completed. In this challenge, a stochastic membership blockmodel is used to process streaming node connectivity data and assign nodes to clusters. Current versions support parallel processing for assigning nodes to clusters, but additional software, firmware, and hardware optimization may be needed to make clustering practicable.

## Human Trafficking Scenario:

The Transnational Constabulary (TC) has instituted a program to combat international human trafficking. The program is using social relationship maps to help identify individuals involved in human trafficking. Investigators have observed that rapid onset natural disasters and complex sectarian conflicts create opportune conditions for trafficking. The current analysis systems cannot keep up with the rate of change in the social networks during these events. The TC would like your team to develop algorithms that can quickly identify social groups in large networks. Output from your algorithm will be used to filter the groups to a smaller set that investigators will examine with slower and more complex analytic tools to detect patterns of human trafficking. Processing time is critical because investigators will want to act against trafficking networks in the post emergency and immediate restoration periods before they become entrenched in a region.

**Challenge 1:** The objective of this challenge is to perform a fast characterization of the members of large social networks using network connectivity data from before and after a disruptive event. The TC will use the two connectivity patterns to identify changes that warrant further investigation to determine if a human trafficking network is forming and to compare the patterns to known human trafficking networks. Triangle detection and k-truss algorithms are able to run in polynomial time and thus have an edge over a number of other algorithms for generating connectivity pattern information for members in a social network. Hardware and firmware acceleration may be required to provide characterization information in a timely manner.

**Challenge 2:** Because social networks evolve in response to natural disasters and sectarian conflicts, a streaming analysis of the communities (partitions) in the social network would be valuable. Early detection of the appearance of a human trafficking network before the full network is received and processed can reduce the delay between the expansion of human trafficking in the region and initial TC operations against their networks. In this challenge, a stochastic membership blockmodel is used to assign individuals to groups or partitions. Current versions support parallel processing for assigning nodes to clusters, but additional software, firmware, and hardware optimization is needed to make the clustering practicable.

## Microbiome Scenario:

Network descriptions of biological systems have become increasingly valuable for understanding that the normal and pathological processes of these large systems. The Public Health Ministry has issued a challenge to industry to develop fast diagnostics for detecting pathologies within microbiomes. This is expected to be especially useful for identifying the initial signals for epidemics through shifts in the prevalence of different organisms in microbiome samples. The expected benefit from the program is that currently undetectable exposures to pathological relationships in individuals’ biome will be able to be detected and long-term negative health consequences avoided. The advancement of technology for rapid sequencing and generation of the biome network is already being pursued by other organizations. Your organization has been requested to improve a system that identifies the component organisms in a biome by rapid detection and “fingerprinting” of clusters in the network. Fast processing is required because predictive analysis of historical public health data estimates that hundreds of thousands of biome network samples may need to be processed within days so that appropriate actions can be taken to stay ahead of any developing health crisis.

**Combined challenge:** This challenge is to combine two algorithms to optimize processing: The first algorithm is based upon a stochastic blockmodel that use streaming analytics to rapidly detect partitions in the biome network, and the second algorithm uses triangle detection and k-truss algorithms to generate “fingerprints” or “hash keys” of the partitions for an initial identification of the different communities in the biome through reference to a comprehensive database of biome “fingerprints.” Fast processing is demanded from the combined set of algorithms. Challengers are to identify the appropriate order for the two algorithms and optimize processing time and accuracy. Challengers may elect to explore system tradeoffs that can improve timeliness and accuracy, including the selection of better and faster algorithms, or the optimization and/or adoption of different hardware, parallel processing approaches, or special purpose computational units. Challengers will be rewarded for novel approaches that result in significant performance improvements over a baseline system.

# Bibliography

Azad, Ariful, Aydin Buluç, and John Gilbert. "Parallel triangle counting and enumeration using matrix algebra." *Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.* Washington, DC, USA: IEEE Computer Society, 2015. 804–811.

Burkhardt, Paul. "Graphing trillions of triangles." *Information Visualization, 0(0):1473871616666393,*, 2016.

Cohen, Jonathan. "Graph Twiddling in a MapReduce World." *Computing in Science and Engg, 11(4).*, July 2009: 29–41.

Cohen, Jonathan. *Trusses: Cohesive subgraphs for social network.* National Security Agency Technical Report, 2008.

Peixoto, Tiago P. "Efficient monte carlo and greedy heuristic." *Physical Rev. E, 89(1):012804*, 2014.

Peixoto, Tiago P. "Entropy of stochastic blockmodel ensembles." *Physical Rev. E, 85(5):056122*, 2012.

Peixoto., Tiago P. "Parsimonious module inference in large networks." *Physical Rev. Letters, 110(14):148701*, 2013.

Wang, Jia and James Cheng. "Truss decomposition in massive networks." *Proc. VLDB Endow. 5(9).* 2012. 812-823.

Wang, Leyuan, Yangzihao Wang, Carl Yang, and John D. Owens. "A comparative study on exact triangle counting algorithms on the gpu." *Proceedings of the ACM Workshop on High Performance Graph Processing.* New York, NY, USA: ACM, 2016. 1–8.

Wolf, M. M., J. W. Berry & D. T. Stark. "A task-based linear algebra building blocks approach for scalable graph analytics." *High Performance Extreme Computing Conference (HPEC), 2015 IEEE.* Waltham, MA: IEEE, 2015. 1-6.