Old info | Notion

Plan

Data collecting
Tags and code markup
Creating of classification models
Building a knowledge graph of Deep Learning process
Proof-of-the-concept of generative code model according to NL description

Current Tasks (until 1.Sep)

The current short-term goal is to build a model that will be able to classify a source code chunk and to specify where the detected class is exactly in the chunk (tag segmentation).

Pipeline proof-of-the-concept: code2vec embeddings & ¿PCA? (clustering) - prerequisite: parse just code from Kaggle, i.e. remove the markdown constraint
To build a CNN-model for searching graph vertices (tags) in the code
Manual code search (in those parsed from kaggle) for specific data types
To validate the efficiency of the tag classification task with the simple models (e.g. LogReg, SVM) on a combination of NL2ML and other corpora, e.g. CoNaLa/StackQC/etc.

Datasets description

(backup stores on https://yadi.sk/d/qY9lEd6-275KEw)

Basically, you can download all the data and the models with dvc pull:

Clone a repo to some folder
Install DVC
Open a terminal and go to the folder with repo
dvc pull or dvc pull data (if you want only the data without models)
Enjoy!