MIT logo

Data Sets

Amazon is making the Graph Challenge data sets available to the community free of charge as part of the AWS Public Data Sets program. The data is being presented in several file formats, and there are a variety of ways to access it.

Data is available in the ‘graphchallenge’ Amazon S3 Bucket.  (https://graphchallenge.s3.amazonaws.com)

Anonymized Network Sensing Graph Challenge

Official 2024 Anonymized Network Sensing Challenge (click to expand) Synthetic pcap files created from GraphBLAS matrices generated with randomized source and destination addresses, using static constants for all other fields in the IP and TCP headers. 

Synthetic Sparse Deep Neural Network data for the Sparse DNN Graph Challenge

Official 2019 Sparse Deep Neural Network Challenge (click to expand) Synthetic DNNs created using RadiX-Net with varying number of neurons and layers.  Truth categories for MNIST are included for performing inference using DNN with specific numbers of layers.

1024 Neurons

Sparse Deep Neural Networks with 1024 Neurons per layer (small, 176 MB)

4096 Neurons

Sparse Deep Neural Networks with 4096 neurons per layer (medium, 800 MB)

16384 Neurons

Sparse Deep Neural Networks with 16384 neurons per layer (large, 3.6 GB)

65536 Neurons

Sparse Deep Neural Networks with 65536 neurons per layer (very large, 16.3 GB)

Sparse DNNs generated using interpolated sparse versions of images in MNIST corpus resized to produce neural networks of varying dimensions.

MNIST-derived Networks

Real and Synthetic Data for the Static Graph Challenge: Subgraph Isomorphism

Real-world graphs from Stanford’s Large Network Dataset Collection (https://snap.stanford.edu/data/) as well as synthetic data at various scales generated using the scalable Graph500 Kronecker generator (http://www.graph500.org/specifications#sec-3_3) are being provided.

Each of the SNAP datasets is provided in both TSV (Tab-Separated Values) and MMIO (Matrix Market I/O) formats.  You can access any desired files directly by crafting a HTTPS or AWS CLI URL using the following URL suffixes and instructions below.

A csv file with metadata about the SNAP datasets below is available here : SNAP Metadata

Metadata includes number of edges, nodes and triangles

SNAP Datasets (click to expand)

amazon0302

Amazon product co-purchasing network from March 2 2003

amazon0312

Amazon product co-purchasing network from March 12 2003

amazon0505

Amazon product co-purchasing network from May 5 2003

amazon0601

Amazon product co-purchasing network from June 1 2003

as20000102

Autonomous Systems graph from January 02 2000

as-caida20071105

CAIDA AS graph from November 5 2007

ca-AstroPh

Collaboration network of Arxiv Astro Physics

ca-CondMat

Collaboration network of Arxiv Condensed Matter

ca-GrQc

Collaboration network of Arxiv General Relativity

ca-HepPh

Collaboration network of Arxiv High Energy Physics

ca-HepTh

Collaboration network of Arxiv High Energy Physics Theory

cit-HepPh

Arxiv High Energy Physics paper citation network

cit-HepTh

Arxiv High Energy Physics Theory paper citation network

cit-Patents

Citation network among US Patents

email-Enron

Email communication network from Enron

email-EuAll

Email network from a EU research institution

facebook_combined

Edges from all Facebook ego networks combined

flickrEdges

Image relationships on Flickr (edges only)

Friendster

Friendster social network graph

loc-brightkite_edges

Brightkite location based online social network

loc-gowalla_edges

Gowalla location based online social network

oregon1_010331

AS peering information inferred from Oregon route-views from March 31 2001

oregon1_010407

AS peering information inferred from Oregon route-views from April 7 2001

oregon1_010414

AS peering information inferred from Oregon route-views from April 14 2001

oregon1_010421

AS peering information inferred from Oregon route-views from April 21 2001

oregon1_010428

AS peering information inferred from Oregon route-views from April 28 2001

oregon1_010505

AS peering information inferred from Oregon route-views from May 05 2001

oregon1_010512

AS peering information inferred from Oregon route-views from May 12 2001

oregon1_010519

AS peering information inferred from Oregon route-views from May 19 2001

oregon1_010526

AS peering information inferred from Oregon route-views from May 26 2001

oregon2_010331

AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from March 31 2001

oregon2_010407

AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from April 7 2001

oregon2_010414

AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from April 14 2001

oregon2_010421

AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from April 21 2001

oregon2_010428

AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from April 28 2001

oregon2_010505

AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from May 05 2001

oregon2_010512

AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from May 12 2001

oregon2_010519

AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from May 19 2001

oregon2_010526

AS peering information inferred from Oregon route-views, Looking glass data, and Routing registry, from May 26 2001

p2p-Gnutella04

Gnutella peer to peer network from August 4 2002

p2p-Gnutella05

Gnutella peer to peer network from August 5 2002

p2p-Gnutella06

Gnutella peer to peer network from August 6 2002

p2p-Gnutella08

Gnutella peer to peer network from August 8 2002

p2p-Gnutella09

Gnutella peer to peer network from August 9 2002

p2p-Gnutella24

Gnutella peer to peer network from August 24 2002

p2p-Gnutella25

Gnutella peer to peer network from August 25 2002

p2p-Gnutella30

Gnutella peer to peer network from August 30 2002

p2p-Gnutella31

Gnutella peer to peer network from August 31 2002

roadNet-CA

roadNet-PA

roadNet-TX

soc-Epinions1

Who-trusts-whom network of Epinions.com

soc-Slashdot0811

Slashdot social network from November 2008

soc-Slashdot0902

Slashdot social network from February 2009

Source for Synthetic Kronecker graph generator available : Kronecker Graphs

Metadata table for exact Kronecker graphs available : Kronecker Graphs (2018 March 16)

Exact Kronecker Graph generator paper: Design, Generation, and Validation of Extreme Scale Power-Law Graphs

Synthetic Kronecker Graphs with many triangles:

Theory-16-25-81-B1k.tsvTheory-16-25-B1k.tsvTheory-256-625-B1k.tsvTheory-25-81-256-B1k.tsvTheory-25-81-B1k.tsvTheory-3-4-5-9-16-25-B1k.tsvTheory-3-4-5-9-16-B1k.tsv
Theory-3-4-5-9-B1k.tsvTheory-3-4-5-B1k.tsvTheory-3-4-B1k.tsvTheory-4-5-9-16-25-B1k.tsvTheory-4-5-9-16-B1k.tsvTheory-4-5-9-B1k.tsvTheory-4-5-B1k.tsv
Theory-5-9-16-25-81-B1k.tsvTheory-5-9-16-25-B1k.tsvTheory-5-9-16-B1k.tsvTheory-5-9-B1k.tsvTheory-81-256-B1k.tsvTheory-9-16-25-81-B1k.tsvTheory-9-16-25-B1k.tsv
Theory-9-16-B1k.tsv

Synthetic Kronecker Graphs with some triangles:

Theory-16-25-81-B2k.tsvTheory-16-25-B2k.tsvTheory-256-625-B2k.tsvTheory-25-81-256-B2k.tsvTheory-25-81-B2k.tsvTheory-3-4-5-9-16-25-B2k.tsvTheory-3-4-5-9-16-B2k.tsv
Theory-3-4-5-9-B2k.tsvTheory-3-4-5-B2k.tsvTheory-3-4-B2k.tsvTheory-4-5-9-16-25-B2k.tsvTheory-4-5-9-16-B2k.tsvTheory-4-5-9-B2k.tsvTheory-4-5-B2k.tsv
Theory-5-9-16-25-81-B2k.tsvTheory-5-9-16-25-B2k.tsvTheory-5-9-16-B2k.tsvTheory-5-9-B2k.tsvTheory-81-256-B2k.tsvTheory-9-16-25-81-B2k.tsvTheory-9-16-25-B2k.tsv
Theory-9-16-B2k.tsv

Protein k-mer graphs generated using data from GenBank: https://www.ncbi.nlm.nih.gov/genbank/ are available below. Nodes of the graph represent segments of amino acids.

Protein k-mer graphs(click to expand)

Graph 1

Num. vertices : 170728175 Edge count : 360585172

Adjacency TSV

Graph 2

Num. vertices : 139353211  Edge count : 297829984

Adjacency TSV

Graph 3

Nun. vertices : 67716231 Edge count : 138778562

Adjacency TSV

Graph 4

Num. vertices : 214005017 Edge count : 465410904

Adjacency TSV

Graph 5

Num. vertices : 55042369 Edge count : 117217600

Adjacency TSV

MAWI Working Group Traffic Archive (http://mawi.wide.ad.jp/mawi/): The MAWI (Measurement and Analysis on the WIDE Internet) Working Group is a working group that has carried out network traffic measurement, analysis, evaluation, and verification from the beginning of the WIDE Project. The graphs provided here were generated from packet trace data from the WIDE backbone maintained by the MAWI Working Group.

MAWI Datasets (click to expand)

Graph 1

Num. vertices : 18571154, Edge count : 38040320

Adjacency TSV

Graph 2

Num. vertices : 35991342, Edge count : 74485420

Adjacency TSV

Graph 3

Num. vertices : 68863315, Edge count : 143414960

Adjacency TSV

Graph 4

Num. vertices 128568730, Edge count : 270234840

Adjacency TSV

Graph 5

Num. vertices : 226196185, Edge count : 480047894

Adjacency TSV

Synthetic Datasets (click to expand)

graph500-scale18-ef16

Synthetic graph500 network of scale 18 (262144×262144, 4194304 edges)

graph500-scale19-ef16

Synthetic graph500 network of scale 19 (524288×524288, 8388608 edges)

graph500-scale20-ef16

Synthetic graph500 network of scale 20 (1048576×1048576, 16777216 edges)

graph500-scale21-ef16

Synthetic graph500 network of scale 21 (2097152×2097152, 33554432 edges)

graph500-scale22-ef16

Synthetic graph500 network of scale 22 (4194304×4194304, 67108864 edges)

graph500-scale23-ef16

Synthetic graph500 network of scale 23 (8388608×8388608, 134217728 edges)

graph500-scale24-ef16

Synthetic graph500 network of scale 24 (16777216×16777216), 268435456 edges)

graph500-scale25-ef16

Synthetic graph500 network of scale 25 (33554432×33554432), 536870912 edges)

Synthetic Data for the Streaming Graph Challenge: Stochastic Block Partition

Provided below are a set of synthetic datasets generated as MxM images where M = 2^n , for n = 8, 9, 10, 11, 12, 13. Each pixel in the image was treated as a node in the graph. Each pixel is connected to its 8-neighbors by an undirected edge. Pixels on the boundary only have 3 neighbors.

Provided below are a set of synthetic datasets with known truth partitions for use in the Stochastic Block Partitioning Graph Challenge.

2017 Streaming Partition Challenge Datasets with Known Truth Partitions (click to expand)

Static Graphs (small)

Small Static Graphs with known truth for the Stochastic Block Partitioning Challenge

Static Graphs (large)

Large Static Graphs with known truth for the Stochastic Block Partitioning Challenge

Streaming – Edge Sampling (small)

Small Streaming Graphs with known truth for the Stochastic Block Partitioning Challenge

Streaming – Edge Sampling (large)

Large Streaming Graphs with known truth for the Stochastic Block Partitioning Challenge

Streaming – Snowball Sampling (small)

Small Streaming Graphs with known truth for the Stochastic Block Partitioning Challenge

Streaming – Snowball Sampling (large)

Large Streaming Graphs with known truth for the Stochastic Block Partitioning Challenge

2022 Streaming Partition Challenge Datasets with Known Truth Partitions (click to expand)
(These datasets have been used for the streaming partition challenge since 2018)

Each setting includes 8 different graph sizes (1K, 5K, 20K, 50K, 200K, 1M, 5M, 20M nodes)

Low Block Overlap and Low Block Size Variation     Full Set

Low level of overlap and low level of size variation between blocks (easiest)

Static Graphs

Streaming Graphs – Edge Sampling

Streaming Graphs – Snowball Sampling

Low Block Overlap and High Block Size Variation    Full Set

Low level of overlap but high level of size variation between blocks

Static Graphs:

Streaming Graphs – Edge Sampling

Streaming Graphs – Snowball Sampling

High Block Overlap and Low Block Size Variation    Full Set

High level of overlap but low level of size variation between blocks

Static Graphs:

Streaming Graphs – Edge Sampling

Streaming Graphs – Snowball Sampling

High Block Overlap and High Block Size Variation    Full Set

High level of overlap and high level of size variation between blocks (hardest)

Static Graphs:

Streaming Graphs – Edge Sampling

Streaming Graphs – Snowball Sampling

Graph data available in the Graph Challenge Amazon S3 bucket uses the following formats and conventions:

<dataset-name>_adj.tsv
(Row, Col, Value) tuple describing the adjacency matrix of the graph in tab separated format.
Adjacency matrix is of size Num_vertices x Num_vertices

<dataset-name>_inc.tsv
(Row, Col, Value) tuple describing the incidence matrix of the graph in tab separated format.
Adjacency matrix is of size Edges x Nun_vertices (Note that some author refer to a transpose of this version)

<dataset-name>_adj.mmio - adjacency matrix of the graph in MMIO format
<dataset-name>_inc.mmio - incidence matrix of the graph in MMIO format

Details and readers for the MMIO format are available here : http://math.nist.gov/MatrixMarket/

Indexing note: All matrices use 1-based indexing

Naming conventions for files provided with each SNAP dataset are as follows:

  • Tab Separated Values – edge list, adjacency (_adj) and incidence (_inc) matrices.
s3://graphchallenge/snap/[URL_SUFFIX]/[URL_SUFFIX].tsv
s3://graphchallenge/snap/[URL_SUFFIX]/[URL_SUFFIX]_adj.tsv
s3://graphchallenge/snap/[URL_SUFFIX]/[URL_SUFFIX]_inc.tsv
  • Matrix Market I/O – edge list, adjacency (_adj) and incidence (_inc) matrices.
s3://graphchallenge/snap/[URL_SUFFIX]/[URL_SUFFIX].mmio
s3://graphchallenge/snap/[URL_SUFFIX]/[URL_SUFFIX]_adj.mmio
s3://graphchallenge/snap/[URL_SUFFIX]/[URL_SUFFIX]_inc.mmio

The format of URLs for the synthetic Graph500 data is:

  • Tab Separated Values – edge list, adjacency (_adj) and incidence (_inc) matrices.
s3://graphchallenge/synthetic/graph500-scale[SCALE]-ef16/graph500-scale[SCALE]-ef16.tsv
s3://graphchallenge/synthetic/graph500-scale[SCALE]-ef16/graph500-scale[SCALE]-ef16_adj.tsv
s3://graphchallenge/synthetic/graph500-scale[SCALE]-ef16/graph500-scale[SCALE]-ef16_inc.tsv
  • Matrix Market I/O – edge list, adjacency (_adj) and incidence (_inc) matrices.
s3://graphchallenge/synthetic/graph500-scale[SCALE]-ef16/graph500-scale[SCALE]-ef16.mmio
s3://graphchallenge/synthetic/graph500-scale[SCALE]-ef16/graph500-scale[SCALE]-ef16_adj.mmio
s3://graphchallenge/synthetic/graph500-scale[SCALE]-ef16/graph500-scale[SCALE]-ef16_inc.mmio

Files can be individually retrieved using a web browser or command-line tools using the URL scheme for Amazon S3 buckets, for example:

Adjacency Matrix – Tab Separated Values (amazon0302)

https://graphchallenge.s3.amazonaws.com/snap/amazon0302/amazon0302_adj.tsv

Adjacency Matrix – Matrix Market I/O (Synthetic Graph500 network, scale 24)

https://graphchallenge.s3.amazonaws.com/synthetic/graph500-scale24-ef16/…

Using either the AWS CLI tools (awscli) or AWS SDK:

To view all available files in the ‘graphchallenge’ bucket:

aws s3 ls s3://graphchallenge/

To download a particular dataset from Amazon S3 to local disk:

aws s3 cp s3://graphchallenge/friendster/ ./friendster/ --recursive

Datasets may also be downloaded one file at a time using the HTTPS URL scheme outlined above.