Data Sets

Amazon is making the Graph Challenge data sets available to the community free of charge as part of the AWS Public Data Sets program. The data is being presented in several file formats, and there are a variety of ways to access it.

Data is available in the 'graphchallenge' Amazon S3 Bucket.  (https://graphchallenge.s3.amazonaws.com)

Synthetic Sparse Deep Neural Network data for the Sparse DNN Graph Challenge

Official 2019 Sparse Deep Neural Network Challenge (click to expand) Synthetic DNNs created using RadiX-Net with varying number of neurons and layers.  Truth categories for MNIST are included for performing inference using DNN with specific numbers of layers.
Name Description / Download Link
1024 Neurons Sparse Deep Neural Networks with 1024 Neurons per layer (small, 176 MB) Download
  120 layers - Categories 480 layers - Categories 1920 layers - Categories
4096 Neurons Sparse Deep Neural Networks with 4096 neurons per layer (medium, 800 MB) Download
  120 layers - Categories 480 layers - Categories 1920 layers - Categories
16384 Neurons Sparse Deep Neural Networks with 16384 neurons per layer (large, 3.6 GB) Download
  120 layers - Categories 480 layers - Categories 1920 layers - Categories
65536 Neurons Sparse Deep Neural Networks with 65536 neurons per layer (very large, 16.3 GB) Download
  120 layers - categories 480 layers - categories 1920 layers - categories
Sparse DNNs generated using interpolated sparse versions of images in MNIST corpus resized to produce neural networks of varying dimensions.
Name Description / Download Link
MNIST-derived Networks
32x32 (1024 neurons) 64x64 (4096 neurons) 128x128 (16384 neurons) 256x256 (65535 neurons)

 

Real and Synthetic Data for the Static Graph Challenge: Subgraph Isomorphism

Real-world graphs from Stanford’s Large Network Dataset Collection (https://snap.stanford.edu/data/) as well as synthetic data at various scales generated using the scalable Graph500 Kronecker generator (http://www.graph500.org/specifications#sec-3_3) are being provided.

Each of the SNAP datasets is provided in both TSV (Tab-Separated Values) and MMIO (Matrix Market I/O) formats.  You can access any desired files directly by crafting a HTTPS or AWS CLI URL using the following URL suffixes and instructions below.

A csv file with metadata about the SNAP datasets below is available here : SNAP Metadata

Metadata includes number of edges, nodes and triangles.

SNAP Datasets (click to expand)
Name Description
amazon0302 Amazon product co-purchasing network from March 2 2003
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
amazon0312 Amazon product co-purchasing network from March 12 2003
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
amazon0505 Amazon product co-purchasing network from May 5 2003
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
amazon0601 Amazon product co-purchasing network from June 1 2003
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
as20000102 Autonomous Systems graph from January 02 2000
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
as-caida20071105 CAIDA AS graph from November 5 2007
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
ca-AstroPh Collaboration network of Arxiv Astro Physics
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
ca-CondMat Collaboration network of Arxiv Condensed Matter
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
ca-GrQc Collaboration network of Arxiv General Relativity
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
ca-HepPh Collaboration network of Arxiv High Energy Physics
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
ca-HepTh Collaboration network of Arxiv High Energy Physics Theory
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
cit-HepPh Arxiv High Energy Physics paper citation network
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
cit-HepTh Arxiv High Energy Physics Theory paper citation network
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
cit-Patents Citation network among US Patents
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
email-Enron Email communication network from Enron
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
email-EuAll Email network from a EU research institution
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
facebook_combined Edges from all Facebook ego networks combined
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
flickrEdges Image relationships on Flickr (edges only)
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
Friendster Friendster social network graph
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
loc-brightkite_edges Brightkite location based online social network
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
loc-gowalla_edges Gowalla location based online social network
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon1_010331 AS peering information inferred from Oregon route-views from March 31 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon1_010407 AS peering information inferred from Oregon route-views from April 7 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon1_010414 AS peering information inferred from Oregon route-views from April 14 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon1_010421 AS peering information inferred from Oregon route-views from April 21 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon1_010428 AS peering information inferred from Oregon route-views from April 28 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon1_010505 AS peering information inferred from Oregon route-views from May 05 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon1_010512 AS peering information inferred from Oregon route-views from May 12 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon1_010519 AS peering information inferred from Oregon route-views from May 19 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon1_010526 AS peering information inferred from Oregon route-views from May 26 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon2_010331 AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from March 31 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon2_010407 AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from April 7 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon2_010414 AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from April 14 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon2_010421 AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from April 21 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon2_010428 AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from April 28 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon2_010505 AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from May 05 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon2_010512 AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from May 12 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon2_010519 AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from May 19 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
oregon2_010526 AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from May 26 2001
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
p2p-Gnutella04 Gnutella peer to peer network from August 4 2002
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
p2p-Gnutella05 Gnutella peer to peer network from August 5 2002
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
p2p-Gnutella06 Gnutella peer to peer network from August 6 2002
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
p2p-Gnutella08 Gnutella peer to peer network from August 8 2002
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
p2p-Gnutella09 Gnutella peer to peer network from August 9 2002
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
p2p-Gnutella24 Gnutella peer to peer network from August 24 2002
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
p2p-Gnutella25 Gnutella peer to peer network from August 25 2002
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
p2p-Gnutella30 Gnutella peer to peer network from August 30 2002
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
p2p-Gnutella31 Gnutella peer to peer network from August 31 2002
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
roadNet-CA Road network of California
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
roadNet-PA Road network of Pennsylvania
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
roadNet-TX Road network of Texas
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
soc-Epinions1 Who-trusts-whom network of Epinions.com
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
soc-Slashdot0811 Slashdot social network from November 2008
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
soc-Slashdot0902 Slashdot social network from February 2009
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO

Source for Synthetic Kronecker graph generator available : Kronecker Graphs

Metadata table for exact Kronecker graphs available : Kronecker Graphs (2018 March 16)

Exact Kronecker Graph generator paper: Design, Generation, and Validation of Extreme Scale Power-Law Graphs

Synthetic Kronecker Graphs with many triangles:

Theory-16-25-81-B1k.tsv Theory-16-25-B1k.tsv Theory-256-625-B1k.tsv Theory-25-81-256-B1k.tsv Theory-25-81-B1k.tsv Theory-3-4-5-9-16-25-B1k.tsv Theory-3-4-5-9-16-B1k.tsv
Theory-3-4-5-9-B1k.tsv Theory-3-4-5-B1k.tsv Theory-3-4-B1k.tsv Theory-4-5-9-16-25-B1k.tsv Theory-4-5-9-16-B1k.tsv Theory-4-5-9-B1k.tsv Theory-4-5-B1k.tsv
Theory-5-9-16-25-81-B1k.tsv Theory-5-9-16-25-B1k.tsv Theory-5-9-16-B1k.tsv Theory-5-9-B1k.tsv Theory-81-256-B1k.tsv Theory-9-16-25-81-B1k.tsv Theory-9-16-25-B1k.tsv
Theory-9-16-B1k.tsv

Synthetic Kronecker Graphs with some triangles:

Theory-16-25-81-B2k.tsv Theory-16-25-B2k.tsv Theory-256-625-B2k.tsv Theory-25-81-256-B2k.tsv Theory-25-81-B2k.tsv Theory-3-4-5-9-16-25-B2k.tsv Theory-3-4-5-9-16-B2k.tsv
Theory-3-4-5-9-B2k.tsv Theory-3-4-5-B2k.tsv Theory-3-4-B2k.tsv Theory-4-5-9-16-25-B2k.tsv Theory-4-5-9-16-B2k.tsv Theory-4-5-9-B2k.tsv Theory-4-5-B2k.tsv
Theory-5-9-16-25-81-B2k.tsv Theory-5-9-16-25-B2k.tsv Theory-5-9-16-B2k.tsv Theory-5-9-B2k.tsv Theory-81-256-B2k.tsv Theory-9-16-25-81-B2k.tsv Theory-9-16-25-B2k.tsv
Theory-9-16-B2k.tsv

Protein k-mer graphs generated using data from GenBank: https://www.ncbi.nlm.nih.gov/genbank/ are available below. Nodes of the graph represent segments of amino acids.

Protein k-mer graphs(click to expand)
Name Description
Graph 1 Num. vertices : 170728175 Edge count : 360585172
  Adjacency TSV
Graph 2 Num. vertices : 139353211  Edge count : 297829984
  Adjacency TSV
Graph 3 Nun. vertices : 67716231 Edge count : 138778562
  Adjacency TSV
Graph 4 Num. vertices : 214005017 Edge count : 465410904
  Adjacency TSV
Graph 5 Num. vertices : 55042369 Edge count : 117217600
  Adjacency TSV

MAWI Working Group Traffic Archive (http://mawi.wide.ad.jp/mawi/): The MAWI (Measurement and Analysis on the WIDE Internet) Working Group is a working group that has carried out network traffic measurement, analysis, evaluation, and verification from the beginning of the WIDE Project. The graphs provided here were generated from packet trace data from the WIDE backbone maintained by the MAWI Working Group.

MAWI Datasets (click to expand)
Name Description
Graph 1 Num. vertices : 18571154, Edge count : 38040320
  Adjacency TSV      
Graph 2 Num. vertices : 35991342, Edge count : 74485420
  Adjacency TSV      
Graph 3 Num. vertices : 68863315, Edge count : 143414960
  Adjacency TSV      
Graph 4 Num. vertices 128568730, Edge count : 270234840
  Adjacency TSV      
Graph 5 Num. vertices : 226196185, Edge count : 480047894
  Adjacency TSV      
Synthetic Datasets (click to expand)
Name Description
graph500-scale18-ef16 Synthetic graph500 network of scale 18 (262144x262144, 4194304 edges)
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
graph500-scale19-ef16 Synthetic graph500 network of scale 19 (524288x524288, 8388608 edges)
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
graph500-scale20-ef16 Synthetic graph500 network of scale 20 (1048576x1048576, 16777216 edges)
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
graph500-scale21-ef16 Synthetic graph500 network of scale 21 (2097152x2097152, 33554432 edges)
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
graph500-scale22-ef16 Synthetic graph500 network of scale 22 (4194304x4194304, 67108864 edges)
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
graph500-scale23-ef16 Synthetic graph500 network of scale 23 (8388608x8388608, 134217728 edges)
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
graph500-scale24-ef16 Synthetic graph500 network of scale 24 (16777216x16777216), 268435456 edges)
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO
graph500-scale25-ef16 Synthetic graph500 network of scale 25 (33554432x33554432), 536870912 edges)
  Adjacency TSV Incidence TSV Adjacency MMIO Incidence MMIO

 

Synthetic Data for the Streaming Graph Challenge: Stochastic Block Partition

Provided below are a set of synthetic datasets generated as MxM images where M = 2^n , for n = 8, 9, 10, 11, 12, 13. Each pixel in the image was treated as a node in the graph. Each pixel is connected to its 8-neighbors by an undirected edge. Pixels on the boundary only have 3 neighbors.

Provided below are a set of synthetic datasets with known truth partitions for use in the Stochastic Block Partitioning Graph Challenge.

2017 Streaming Partition Challenge Datasets with Known Truth Partitions (click to expand)
Name Description
Static Graphs (small) Small Static Graphs with known truth for the Stochastic Block Partitioning Challenge
  50 nodes 100 nodes 500 nodes 1000 nodes 5000 nodes
Static Graphs (large) Large Static Graphs with known truth for the Stochastic Block Partitioning Challenge
  20000 nodes 50000 nodes 500000 nodes 2000000 nodes 5000000 nodes
Streaming - Edge Sampling (small) Small Streaming Graphs with known truth for the Stochastic Block Partitioning Challenge
500 nodes 1000 nodes 5000 nodes
Streaming - Edge Sampling (large) Large Streaming Graphs with known truth for the Stochastic Block Partitioning Challenge
  20000 nodes 50000 nodes 500000 nodes 2000000 nodes 5000000 nodes
Streaming - Snowball Sampling (small) Small Streaming Graphs with known truth for the Stochastic Block Partitioning Challenge
500 nodes 1000 nodes 5000 nodes
Streaming - Snowball Sampling (large) Large Streaming Graphs with known truth for the Stochastic Block Partitioning Challenge
  20000 nodes 50000 nodes 500000 nodes 2000000 nodes 5000000 nodes

 

2022 Streaming Partition Challenge Datasets with Known Truth Partitions (click to expand)
(These datasets have been used for the streaming partition challenge since 2018)

Each setting includes 8 different graph sizes (1K, 5K, 20K, 50K, 200K, 1M, 5M, 20M nodes)

Name Description
Low Block Overlap and Low Block Size Variation     Full Set        Low level of overlap and low level of size variation between blocks (easiest)
Static Graphs 1K 5K 20K 50K 200K 1M 5M 20M
Streaming Graphs - Edge Sampling 1K 5K 20K 50K 200K 1M 5M 20M
Streaming Graphs - Snowball Sampling 1K 5K 20K 50K 200K 1M 5M 20M
Low Block Overlap and High Block Size Variation    Full Set        Low level of overlap but high level of size variation between blocks
Static Graphs:                      1K 5K 20K 50K 200K 1M 5M 20M
Streaming Graphs - Edge Sampling 1K 5K 20K 50K 200K 1M 5M 20M
Streaming Graphs - Snowball Sampling 1K 5K 20K 50K 200K 1M 5M 20M
High Block Overlap and Low Block Size Variation    Full Set        High level of overlap but low level of size variation between blocks
Static Graphs:                      1K 5K 20K 50K 200K 1M 5M 20M
Streaming Graphs - Edge Sampling 1K 5K 20K 50K 200K 1M 5M 20M
Streaming Graphs - Snowball Sampling 1K 5K 20K 50K 200K 1M 5M 20M
High Block Overlap and High Block Size Variation    Full Set        High level of overlap and high level of size variation between blocks (hardest)
Static Graphs:                      1K 5K 20K 50K 200K 1M 5M 20M
Streaming Graphs - Edge Sampling 1K 5K 20K 50K 200K 1M 5M 20M
Streaming Graphs - Snowball Sampling 1K 5K 20K 50K 200K 1M 5M 20M

 

Graph data available in the Graph Challenge Amazon S3 bucket uses the following formats and conventions:

   <dataset-name>_adj.tsv
   (Row, Col, Value) tuple describing the adjacency matrix of the graph in tab separated format.
   Adjacency matrix is of size Num_vertices x Num_vertices

   <dataset-name>_inc.tsv
   (Row, Col, Value) tuple describing the incidence matrix of the graph in tab separated format.
   Adjacency matrix is of size Edges x Nun_vertices (Note that some author refer to a transpose of this version)

   <dataset-name>_adj.mmio - adjacency matrix of the graph in MMIO format
   <dataset-name>_inc.mmio - incidence matrix of the graph in MMIO format

Details and readers for the MMIO format are available here : http://math.nist.gov/MatrixMarket/

Indexing note: All matrices use 1-based indexing

Naming conventions for files provided with each SNAP dataset are as follows:

  • Tab Separated Values – edge list, adjacency (_adj) and incidence (_inc) matrices.

        s3://graphchallenge/snap/[URL_SUFFIX]/[URL_SUFFIX].tsv
        s3://graphchallenge/snap/[URL_SUFFIX]/[URL_SUFFIX]_adj.tsv
        s3://graphchallenge/snap/[URL_SUFFIX]/[URL_SUFFIX]_inc.tsv
  • Matrix Market I/O – edge list, adjacency (_adj) and incidence (_inc) matrices.

        s3://graphchallenge/snap/[URL_SUFFIX]/[URL_SUFFIX].mmio
        s3://graphchallenge/snap/[URL_SUFFIX]/[URL_SUFFIX]_adj.mmio
        s3://graphchallenge/snap/[URL_SUFFIX]/[URL_SUFFIX]_inc.mmio

The format of URLs for the synthetic Graph500 data is:

  • Tab Separated Values – edge list, adjacency (_adj) and incidence (_inc) matrices.

        s3://graphchallenge/synthetic/graph500-scale[SCALE]-ef16/graph500-scale[SCALE]-ef16.tsv
        s3://graphchallenge/synthetic/graph500-scale[SCALE]-ef16/graph500-scale[SCALE]-ef16_adj.tsv
        s3://graphchallenge/synthetic/graph500-scale[SCALE]-ef16/graph500-scale[SCALE]-ef16_inc.tsv
  • Matrix Market I/O – edge list, adjacency (_adj) and incidence (_inc) matrices.

        s3://graphchallenge/synthetic/graph500-scale[SCALE]-ef16/graph500-scale[SCALE]-ef16.mmio
        s3://graphchallenge/synthetic/graph500-scale[SCALE]-ef16/graph500-scale[SCALE]-ef16_adj.mmio
        s3://graphchallenge/synthetic/graph500-scale[SCALE]-ef16/graph500-scale[SCALE]-ef16_inc.mmio

Files can be individually retrieved using a web browser or command-line tools using the URL scheme for Amazon S3 buckets, for example:

Adjacency Matrix – Tab Separated Values (amazon0302)

https://graphchallenge.s3.amazonaws.com/snap/amazon0302/amazon0302_adj.tsv

Adjacency Matrix – Matrix Market I/O (Synthetic Graph500 network, scale 24)

https://graphchallenge.s3.amazonaws.com/synthetic/graph500-scale24-ef16/...

Using either the AWS CLI tools (awscli) or AWS SDK:

To view all available files in the 'graphchallenge' bucket:

aws s3 ls s3://graphchallenge/

To download a particular dataset from Amazon S3 to local disk:

aws s3 cp s3://graphchallenge/friendster/ ./friendster/ --recursive

Datasets may also be downloaded one file at a time using the HTTPS URL scheme outlined above.