KDD Cup 2020
Rank Team Code
1 aister https://github.com/aister2020/KDDCUP_2020_AutoGraph_1st_Place
2 PASA_NJU https://github.com/Unkrible/AutoGraph2020
3 qqerret https://github.com/white-bird/kdd2020_GCN
4 common https://github.com/joneswong/AutoGraph
5 PostDawn https://github.com/hxttkl/KDDCUP2020_AutoEnsemble
6 (tie) SmartMN-THU https://github.com/AutoGraphMaNlab/AutoGraph
6 (tie) JunweiSun https://github.com/JunweiSUN/AutoGL
8 u1234x1234 https://github.com/u1234x1234/KDD-Cup-2020-AutoGraph-Challenge
9 shiqitao https://github.com/shiqitao/AutoGraphM
10 (tie) supergx https://github.com/AnryYang/feat-propagation
10 (tie) daydayup https://github.com/daydayupPro/KDDCup2020_AutoGraph

Overview

KDD 2020will be held in San Diego, CA, USA from August 23 to 27, 2020. TheAutomatic Graph Representation Learningchallenge (AutoGraph),the first ever AutoML challenge applied to Graph-structured data, is theAutoMLtrack challenge in KDD Cup 2020 provided by 4Paradigm, ChaLearn, Stanford and Google. The challenge website could be found here:https://www.automl.ai/competitions/3

Machine learning on graph-structured data.Graph-structured data have been ubiquitous in real-world, such as social networks, scholar networks, knowledge graph etc. Graph representation learning has been a very hot topic, and the goal is tolearn low-dimensional representation of each node in the graph, which are used for downstream tasks, such as friend recommendation in a social network, or classifying academic papers into different subjects in a citation network. Traditionally, heuristics are exploited to extract features for each node from the graph, e.g., the degree statistics, or random walk based similarities. However, in recent years, sophisticated models such as graph neural networks (GNN) have been proposed for the graph representation learning tasks, which lead to the state-of-the-art results in many tasks, such as node classification, or link prediction.

Challenges in developing versatile models.Nevertheless, no matter the traditional heuristic methods or recent GNN based methods, huge computational and expertise resources are needed to be invested to achieve a satisfying performance given a task. For example, in DeepWalk and node2vec, two well-known random walk based methods, various hyper-parameters like the length and number of walks per node, the window size, have to be fine-tuned to obtain better performance. And when using the GNN models, e.g. GraphSAGE or GAT, we have to spend quite a lot of time to choose the optimal aggregation function in GraphSAGE, or head numbers of self-attention in GAT. Therefore, it limits the application of the existing graph representation models due to the huge demand of human experts in fine-tuning process.

Autograph Challenge.AutoML/AutoDL (https://autodl.chalearn.org) is a promising approach to lower the manpower costs of machine learning applications, and has achieved encouraging successes in hyper-parameter tuning, model selection, neural architecture search, and feature engineering. In order to enable more people and organizations to fully exploit their graph-structured data, we organize AutoGraph challenge dedicated to such data.

In this challenge, participants should design a computer program capable of providing solutions to graph representation learning problems autonomously (without any human intervention). Compared toprevious AutoML competitionswe organized, our new focus is onGraph-structured Data, where nodes with features and edges (connections among nodes) are available.
To prevail in the proposed challenge, participants should propose automatic solutions that can effectively and efficiently learn high-quality representation for each node based on the given features, neighborhood and structural information underlying the graph. The solutions should be designed to automatically extract and utilize any useful signals in the graph no matter by heuristic or systematic models. Here, we list some specific questions that the participants should consider and answer:
  • How to automatically design heuristics to extract features for a node in graph?
  • How to automatically exploit the neighborhood information in a graph ?
  • How to automatically tune an optimal set of hyper-parameters for random walk based graph embedding methods ?
  • How to automatically choose the aggregation function when using the GNN-based models?
  • How to automatically design an optimal GNN architecture given different datasets?
  • How to automatically and efficiently select appropriate hyper-parameters for different models?
  • How to make the solution more generic, i.e., how to make it applicable for unseen tasks?
  • How to keep the computational and memory cost acceptable?

Tentative Timeline

  • March 25th, 2020: Beginning of Feedback Phase, release of public datasets. Participants can start submitting codes and obtaining immediate feedback in the leaderboard.
  • May 25th, 2020: End of Feedback Phase
  • May 26th, 2020: Beginning of Check Phase
  • June 1st, 2020: End of Check Phase,Organizer notifying results of Check Phase
  • June 2nd, 2020: Beginning of Final Phase
  • June 4th, 2020: Deadline for re-submitting to Final Phase
  • June 5th, 2020: Deadline for submitting the fact sheets
  • June 7th, 2020: End of Final Phase, beginning of post competition process
  • June 9th, 2020: Announcement of the KDD Cup 2020 Winners
  • Auguest 22nd, 2020: Beginning of KDD 2020

Dataset

This page describes the datasets used in AutoGraph challenge. 15 graph datasets are prepared for this competition.5 public datasets, which can be downloaded, are provided to the participants so that they can develop their solutions offline. Besides that, another5 feedback datasetsare also provided to participants to evaluate the public leaderboard scores of their AutoGraph solutions. Afterwards, their solutions will be evaluated with5 final datasetswithout human intervention.

This challenge focuses on the problem of graph representation learning, wherenode classificationis chosen as the task to evaluate the quality of learned representations.

Note that you can try more datasets to debug your solutions in theopen graph benchmarkandSNAP projectfrom Stanford University.

Components

The datasets are collected from real-world business, and are shuffled and split into training and testing parts. Each dataset contains two node files (training and testing), an edge file, a feature file, two label files (training and testing) and a metadata file.
Please note that the data files are read by our program and sent to the participant's program. For the details, please seeEvaluations.

  • Thetraining node file (train_node_id.txt) and testing node file (test_node_id.txt)list all node indices used for training and testing correspondingly. The nodeindices are int type.

    Example:

    node_index 0 1 2 3 4 5 6 7 8

  • Theedge file (edge.tsv)contains a set of triplets. A triplet in the form (src_idx, dst_idx, edge_weight) describes a connection from node index src_idx to node dst_idx with the edge weight edge_weight. The type of edge_weight is numerical (float or int)

    Example:

    src_idx dst_idx edge_weight 0 62 1 0 40 1 0 127 1 0 178 1 0 53 1 0 67 1 0 189 1 0 135 1 0 48 1

  • Thefeature file (feature.tsv)is in tsv format. A line of the file is in the format: (node_index f0 f1 ...), where node_index is the index of a node and f0, f1, ... are its features

    The types of features are allnumerical

    Example:

    node_index f0 f1 f2 f3 f4 0 0.47775876104073356 0.05387578793865644 0.729954200019264 0.6908184238803438 0.9235037015600726 1 0.34224099072954905 0.6693042243297719 0.08736572053032532 0.07358721227831977 0.27398819586899037 2 0.8259856025619777 0.4421366756096389 0.9872258141866499 0.4865590790508849 0.12633483872234397 3 0.11177231902956064 0.40446709473609854 0.2293892960354328 0.4021930454713125 0.40698138834963693 4 0.34427740190016 0.26622372452918375 0.8042497280547812 0.0022605424347530434 0.8903425653304337 5 0.08640169107378592 0.43038539444039425 0.6635778390235518 0.9229371884297638 0.8912709075205572 6 0.6765202023072282 0.9039673560303431 0.986304900152288 0.23661480664770496 0.7140162062880935 7 0.043651531427249424 0.010090830922163785 0.758404203984433 0.05315076246728134 0.8017402643849966 8 0.49802375200717 0.6735698429117265 0.04292694482433346 0.3033723691640159 0.43132281219124635

  • Thetraining label file (train_label.tsv) and the testing label file (test_label.tsv)are also in tsv format and contains label information of training and testing nodes correspondingly. A line in the files is in the format: (node_index class), where node_index is the index of a node and class is its label.

    Example:

    node_index class 0 1 1 3 2 1 3 1 4 3 5 1 6 1 7 3 8 1

  • Themetadata file (config.yml)is in yaml format. It provides meta-information of the datasets, including:

    • schema: DEPRECATED
    • the number of label classes in the dataset
    • the time budget of the dataset

    Example:

    time_budget: 5000 n_class: 7


Rules

This challenge hasthree phases. The participants are provided with 5 public datasets which can be downloaded, so that they can develop their solutions offline. Then, the code will be uploaded to the platform and participants will receive immediate feedback on the performance of their method at another 5 feedback datasets. AfterFeedback Phaseterminates, we will have anotherCheck Phase, where participants are allowed to submit their codeonly onceon final datasets in order to debug. Participants won't be able to read detailed logs but they are able to see whether their code report errors. Last, inFinal Phase, participants' solutionswill be evaluated on 5 final datasets. The ranking in Final Phase will count towards determining the winners.

Code submitted is trained and tested automatically, without any human intervention. Code submitted onFeedback (resp. Final) Phaseis run on all 5 feedback (resp. final) datasets in parallel on separate compute workers, each one with its own time budget.

The identities of the datasets used for testing on the platform are concealed.The data are provided in araw form(no feature extraction) to encourage researchers to use Deep Learning methods performing automatic feature learning, although this is NOT a requirement. All problems arenode classificationproblems. The tasks are constrained by thetime budget.

Here is some pseudo-code of the evaluation protocol:

# For each dataset, our evaluation program calls the model constructor: # load the dataset dataset = Dataset(args.dataset_dir) # get information about the dataset time_budget = dataset.get_metadata().get("time_budget") n_class = dataset.get_metadata().get("n_class") schema = dataset.get_metadata().get("schema") # import and initialize the participant's Model class umodel = init_usermodel() # initialize the timer timer = _init_timer(time_budget) # train the model and predict the labels of testing data predictions = _train_predict(umodel, dataset, timer, n_class, schema)

Metrics

For both Feedback Phase and Final Phase,Accuracyis evaluated on each dataset. The submissions will be ranked by the averaged rank on all datasets of a phase.

Note that if a submission fails on a certain dataset, a default score (-1 in this challenge) will be marked in the corresponding dataset of leaderbaord.


API

The participants should implement a classModelwith a class methodtrain_predict, which is described as follows:

class Model: """user model""" def __init__(self): # init def train_predict(self, data, time_budget, n_class, schema): """train and prediction This method will be called by the competition platform and constraint with time_budget. Parameters: ----------- data: dict, store all input data. keys and values are: 'fea_table': pandas.DataFrame, features for training and testing dataset, 'edge_file': pandas.DataFrame, edge information of the graph, dtypes of all columns are int 'train_indices': list of int, indices of all training nodes 'test_indices': list of int, indices of all testing nodes 'train_label': pandas.DataFrame, labels of training nodes for the details, please check the format of data files. n_class: int, the number of classes in this task schema: this is deprecated Return ------ pred: list(or pandas.Series / 1D numpy.ndarray) pred contains predictions for all testing samples, and they are in the same order as test_indices """ return pred

It is the responsibility of the participants to make sure that the "train_predict" method does not exceed the time budget.


Prizes

1st Prize: 15000 USD

2nd Prize: 10000 USD

3rd Prize: 5000 USD

4th - 10th prize: 500 USD each

About

Pleasecontact the organizersif you have any problem concerning this challenge.


Advisors

  • - Wei-Wei Tu, 4Pardigm Inc., China andChaLearn, USA
  • - Jure Leskovec, Stanford University, USA
  • -Hugo Jair Escalante,IANOE, Mexico and ChaLearn, USA
  • - Isabelle Guyon, Université Paris-Saclay, France, ChaLearn, USA
  • - Qiang Yang, Hong Kong University of Science and Technology, Hong Kong, China

Committee (alphabetical order)

  • - Xiawei Guo, 4Paradigm Inc., China
  • - Shouxiang Liu,4Paradigm Inc., China
  • - Zhen Xu,4Paradigm Inc., China
  • - Rex Ying, Stanford University, USA
  • - Huan Zhao,4Paradigm Inc., China

Organizing Institutes

4paradigm

chalean

stanford

Google



AboutAutoML

Previous AutoML Challenges:


About 4Paradigm Inc.

Founded in early 2015,4Paradigmis one of the world’s leading AI technology and service providers for industrial applications. 4Paradigm’s flagship product – the AI Prophet – is an AI development platform that enables enterprises to effortlessly build their own AI applications, and thereby significantly increase their operation’s efficiency. Using the AI Prophet, a company can develop a data-driven “AI Core System”, which could be largely regarded as a second core system next to the traditional transaction-oriented Core Banking System (IBM Mainframe) often found in banks. Beyond this, 4Paradigm has also successfully developed more than 100 AI solutions for use in various settings such as finance, telecommunication and internet applications. These solutions include, but are not limited to, smart pricing, real-time anti-fraud systems, precision marketing, personalized recommendation and more. And while it is clear that 4Paradigm can completely set up a new paradigm that an organization uses its data, its scope of services does not stop there. 4Paradigm uses state-of-the-art machine learning technologies and practical experiences to bring together a team of experts ranging from scientists to architects. This team has successfully built China’s largest machine learning system and the world’s first commercial deep learning system. However, 4Paradigm’s success does not stop there. With its core team pioneering the research of “Transfer Learning,” 4Paradigm takes the lead in this area, and as a result, has drawn great attention of worldwide tech giants.

About ChaLearn

ChaLearnis a non-profit organization with vast experience in the organization of academic challenges. ChaLearn is interested in all aspects of challenge organization, including data gathering procedures, evaluation protocols, novel challenge scenarios (e.g., competitions), training for challenge organizers, challenge analytics, resultdissemination and, ultimately, advancing the state-of-the-art through challenges.