Plan
- Data collecting
- Tags and code markup
- Creating of classification models
- Building a knowledge graph of Deep Learning process
- Proof-of-the-concept of generative code model according to NL description
Current Tasks (until 1.Sep)
The current short-term goal is to build a model that will be able to classify a source code chunk and to specify where the detected class is exactly in the chunk (tag segmentation).
- Pipeline proof-of-the-concept: code2vec embeddings & ¿PCA? (clustering) - prerequisite: parse just code from Kaggle, i.e. remove the markdown constraint
- To build a CNN-model for searching graph vertices (tags) in the code
- Manual code search (in those parsed from kaggle) for specific data types
- To validate the efficiency of the tag classification task with the simple models (e.g. LogReg, SVM) on a combination of NL2ML and other corpora, e.g. CoNaLa/StackQC/etc.
Datasets description
(backup stores on https://yadi.sk/d/qY9lEd6-275KEw)
Basically, you can download all the data and the models with dvc pull:
- Clone a repo to some folder
- Install DVC
- Open a terminal and go to the folder with repo
- dvc pull or dvc pull data (if you want only the data without models)
- Enjoy!