{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Odyssey Tutorial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Welcome to Odyssey! Odyssey is a Python package that can analyze python library usage on GitHub through Google BigQuery. The purpose of this tutorial is to provide you a high-level idea of how to use it. Let's begin!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Work with GithubPython object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start by introducing a central piece of Odyssey -- GithubPython object. This is the object that connects to Github data using BigQuery. It takes care of all the BigQuery connection, SQL query building, result polling, etc. for you." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by creating a default GithubPython object. Because we didn't specify any package, the information we will get is about all data in the BigQuery Github database." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from odyssey.core.bigquery.GithubPython import GithubPython\n", "gp = GithubPython()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to see how many Python files in our BigQuery Github database." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may think: \"Wow that's way less than I expect. Does it mean that we only have ~5.9 million Python files on Github? The answer is no. The main reason is that Google BigQuery only has access to open-sourced repos on Github (those who has certain licences). Therefore, it is just a small subset of the whole Github.\n", "\n", "That's why, if you search for *.py file using Github web GUI, the number you will get won't be comparable to the number you get here." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5995653\n" ] } ], "source": [ "print(gp.get_count())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's create another GithubPython object, but this time, specify that the package we are interested in is sklearn.\n", "\n", "Also, Odyssey allows you to exclude forks of the package, by explicitly providing a list of keywords that shouldn't appear in the repo name or file path. In this case, scikit-learn is the one we should avoid counting." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "gp_sklearn = GithubPython(package=\"sklearn\", exclude_forks=[\"scikit-learn\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's count then how many files that count \"sklearn\". **Caveat: Note that this is a simple string matching. So even if sklearn appears in comment or as a variable name, it will still count!**" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "37262\n" ] } ], "source": [ "print(gp_sklearn.get_count())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to see exactly what are those 37262 files that contain the word \"sklearn\", you can use get_all() to see all the entries. The return result is a list of BigQueryGithubEntry, a wrapper that provides nice utility function, such as get_url()." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = gp_sklearn.get_all()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "print(type(data[0]))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from odyssey.utils.output import pprint_ipynb" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
#!/usr/bin/env python\n",
       "# -*- coding: utf-8 -*-\n",
       "\n",
       """"Chainer example: Convolutional Neural Networks with SPP for Sentence Classification\n",
       "\n",
       "http://emnlp2014.org/papers/pdf/EMNLP2014181.pdf\n",
       "https://arxiv.org/pdf/1406.4729v4.pdf\n",
       "\n",
       """"\n",
       "\n",
       "__version__ = '0.0.1'\n",
       "\n",
       "import sys\n",
       "\n",
       "reload(sys)\n",
       "sys.setdefaultencoding('utf-8')\n",
       "#print sys.getdefaultencoding()\n",
       "\n",
       "import re\n",
       "import logging\n",
       "logger = logging.getLogger(__name__)\n",
       "handler = logging.StreamHandler()\n",
       "logger.setLevel(logging.DEBUG)\n",
       "logger.addHandler(handler)\n",
       "\n",
       "import pprint\n",
       "def pp(obj):\n",
       "    pp = pprint.PrettyPrinter(indent=1, width=160)\n",
       "    str = pp.pformat(obj)\n",
       "    print re.sub(r"\\\\u([0-9a-f]{4})", lambda x: unichr(int("0x"+x.group(1),16)), str)\n",
       "\n",
       "import os, time, six\n",
       "start_time = time.time()\n",
       "\n",
       "import struct\n",
       "import numpy as np\n",
       "import cPickle as pickle\n",
       "import matplotlib.pyplot as plt\n",
       "import copy\n",
       "\n",
       "from chainer import cuda, Chain, Variable, optimizers, serializers, computational_graph\n",
       "import chainer.functions as F\n",
       "import chainer.links as L\n",
       "import chainer.optimizer\n",
       "\n",
       "xp = np\n",
       "BOS_TOKEN = '<s>'\n",
       "EOS_TOKEN = '</s>'\n",
       "UNK_TOKEN = '<unk>'\n",
       "PAD_TOKEN = '<pad>'\n",
       "\n",
       "\n",
       "def load_w2v_model(path):\n",
       "\n",
       "    # with open(path, 'rb') as f:\n",
       "    #     w2i = {}\n",
       "    #     i2w = {}\n",
       "    #\n",
       "    #     n_vocab, n_units = map(int, f.readline().split())\n",
       "    #     w = np.empty((n_vocab, n_units), dtype=np.float32)\n",
       "    #\n",
       "    #     for i in xrange(n_vocab):\n",
       "    #         word = ''\n",
       "    #         while True:\n",
       "    #             ch = f.read(1)\n",
       "    #             if ch == ' ': break\n",
       "    #             word += ch\n",
       "    #\n",
       "    #         try:\n",
       "    #             w2i[unicode(word)] = i\n",
       "    #             i2w[i] = unicode(word)\n",
       "    #\n",
       "    #         except RuntimeError:\n",
       "    #             logging.error('Error unicode(): %s', word)\n",
       "    #             w2i[word] = i\n",
       "    #             i2w[i] = word\n",
       "    #\n",
       "    #         w[i] = np.zeros(n_units)\n",
       "    #         for j in xrange(n_units):\n",
       "    #             w[i][j] = struct.unpack('f', f.read(struct.calcsize('f')))[0]\n",
       "    #\n",
       "    #         # ベクトルを正規化する\n",
       "    #         vlen = np.linalg.norm(w[i], 2)\n",
       "    #         w[i] /= vlen\n",
       "    #\n",
       "    #         # 改行を strip する\n",
       "    #         assert f.read(1) == '\\n'\n",
       "    # return w, w2i, i2w\n",
       "\n",
       "    from gensim.models import word2vec\n",
       "    return word2vec.Word2Vec.load_word2vec_format(path, binary=True)\n",
       "\n",
       "\n",
       "def load_data(path, w2v):\n",
       "    X_data, Y = [], []\n",
       "    labels = {}\n",
       "\n",
       "    X = []\n",
       "    max_len = 0\n",
       "\n",
       "    f = open(path, 'rU')\n",
       "    for i, line in enumerate(f):\n",
       "        # if i >= 10:\n",
       "        #     break\n",
       "\n",
       "        line = unicode(line).strip()\n",
       "        if line == u'':\n",
       "            continue\n",
       "\n",
       "        cols = line.split(u'\\t')\n",
       "        if len(cols) < 2:\n",
       "            sys.stderr.write('invalid record: {}\\n'.format(line))\n",
       "            continue\n",
       "\n",
       "        label = cols[0]\n",
       "        text  = cols[1]\n",
       "\n",
       "        tokens = text.split(' ')\n",
       "\n",
       "        vec = []\n",
       "        for token in tokens:\n",
       "            try:\n",
       "                vec.append(w2v[token])\n",
       "            except KeyError:\n",
       "                sys.stderr.write('unk: {}\\n'.format(token))\n",
       "                vec.append(w2v.seeded_vector(UNK_TOKEN))\n",
       "\n",
       "        if len(vec) > max_len:\n",
       "            max_len = len(vec)\n",
       "        X.append(vec)\n",
       "\n",
       "        if label not in labels:\n",
       "            labels[label] = len(labels)\n",
       "        Y.append(labels[label])\n",
       "\n",
       "    f.close()\n",
       "\n",
       "    for vec in X:\n",
       "        pad = [w2v.seeded_vector(PAD_TOKEN) for _ in range(max_len - len(vec))]\n",
       "        vec.extend(pad)\n",
       "\n",
       "    return X, Y, labels\n",
       "\n",
       "\n",
       "class MySPP(Chain):\n",
       "    def __init__(self, input_channel, output_channel, width, n_units, n_label):\n",
       "        super(MySPP, self).__init__(\n",
       "            conv1=L.Convolution2D(input_channel, output_channel, (3, width), pad=0),\n",
       "            conv2=L.Convolution2D(input_channel, output_channel, (4, width), pad=0),\n",
       "            conv3=L.Convolution2D(input_channel, output_channel, (5, width), pad=0),\n",
       "            fc4=L.Linear(output_channel * 3 * 3, n_units),\n",
       "            fc5=L.Linear(n_units, n_label)\n",
       "        )\n",
       "\n",
       "    def __call__(self, x, t, train=True):\n",
       "        y = self.forward(x, train=train)\n",
       "        return F.softmax_cross_entropy(y, t), F.accuracy(y, t)\n",
       "\n",
       "    def forward(self, x, train=True):\n",
       "        h1 = F.spatial_pyramid_pooling_2d(F.relu(self.conv1(x)), 2, F.MaxPooling2D)\n",
       "        h2 = F.spatial_pyramid_pooling_2d(F.relu(self.conv2(x)), 2, F.MaxPooling2D)\n",
       "        h3 = F.spatial_pyramid_pooling_2d(F.relu(self.conv3(x)), 2, F.MaxPooling2D)\n",
       "\n",
       "        # Convolution + Pooling を行った結果を結合する\n",
       "        concat = F.concat((h1, h2, h3), axis=1)\n",
       "\n",
       "        # 結合した結果に Dropout をかける\n",
       "        h4 = F.dropout(F.tanh(self.fc4(concat)), ratio=0.5, train=train)\n",
       "\n",
       "        # Dropout の結果を結合する\n",
       "        y = self.fc5(h4)\n",
       "\n",
       "        return y\n",
       "\n",
       "\n",
       "if __name__ == '__main__':\n",
       "\n",
       "    from argparse import ArgumentParser\n",
       "    parser = ArgumentParser(description='Chainer example: MySPP')\n",
       "    parser.add_argument('--train',           default='',  type=unicode, help='training file (.txt)')\n",
       "    parser.add_argument('--test',            default='',  type=unicode, help='evaluating file (.txt)')\n",
       "    parser.add_argument('--w2v',       '-w', default='',  type=unicode, help='word2vec model file (.bin)')\n",
       "    parser.add_argument('--gpu',       '-g', default=-1,  type=int, help='GPU ID (negative value indicates CPU)')\n",
       "    parser.add_argument('--epoch',     '-e', default=25,  type=int, help='number of epochs to learn')\n",
       "    parser.add_argument('--unit',      '-u', default=300, type=int, help='number of output channels')\n",
       "    parser.add_argument('--batchsize', '-b', default=100, type=int, help='learning batchsize size')\n",
       "    parser.add_argument('--output',    '-o', default='model-spp3-w2v',  type=str, help='output directory')\n",
       "    args = parser.parse_args()\n",
       "\n",
       "    if args.gpu >= 0:\n",
       "        cuda.check_cuda_available()\n",
       "        cuda.get_device(args.gpu).use()\n",
       "\n",
       "    xp = cuda.cupy if args.gpu >= 0 else np\n",
       "    # xp.random.seed(123)\n",
       "\n",
       "    # 学習の繰り返し回数\n",
       "    n_epoch = args.epoch\n",
       "\n",
       "    # 中間層の数\n",
       "    n_units = args.unit\n",
       "\n",
       "    # 確率的勾配降下法で学習させる際の1回分のバッチサイズ\n",
       "    batchsize = args.batchsize\n",
       "\n",
       "    model_dir = args.output\n",
       "    if not os.path.exists(model_dir):\n",
       "        os.mkdir(model_dir)\n",
       "\n",
       "    print('# loading word2vec model: {}'.format(args.w2v))\n",
       "    sys.stdout.flush()\n",
       "    model = load_w2v_model(args.w2v)\n",
       "    n_vocab = len(model.vocab)\n",
       "\n",
       "    # データの読み込み\n",
       "    X, y, labels = load_data(args.train, w2v=model)\n",
       "    X = xp.asarray(X, dtype=np.float32)\n",
       "    y = xp.asarray(y, dtype=np.int32)\n",
       "\n",
       "    n_sample = X.shape[0]\n",
       "    height   = X.shape[1]\n",
       "    width    = X.shape[2]\n",
       "    n_label = len(labels)\n",
       "\n",
       "    input_channel = 1\n",
       "    output_channel = 50\n",
       "\n",
       "    # (nsample, channel, height, width) の4次元テンソルに変換\n",
       "    X = X.reshape((n_sample, input_channel, height, width))\n",
       "\n",
       "    # トレーニングデータとテストデータに分割\n",
       "    from sklearn.model_selection import train_test_split\n",
       "    # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=123)\n",
       "    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10)\n",
       "    N = len(X_train)\n",
       "    N_test = len(X_test)\n",
       "\n",
       "    print('# gpu: {}'.format(args.gpu))\n",
       "    print('# embedding dim: {}, vocab {}'.format(width, n_vocab))\n",
       "    print('# epoch: {}'.format(n_epoch))\n",
       "    print('# batchsize: {}'.format(batchsize))\n",
       "    print('# input channel: {}'.format(1))\n",
       "    print('# output channel: {}'.format(n_units))\n",
       "    print('# train: {}, test: {}'.format(N, N_test))\n",
       "    print('# data height: {}, width: {}, labels: {}'.format(height, width, n_label))\n",
       "    sys.stdout.flush()\n",
       "\n",
       "    # Prepare CNN\n",
       "    model = MySPP(input_channel, output_channel, width, n_units, n_label)\n",
       "\n",
       "    if args.gpu >= 0:\n",
       "        model.to_gpu()\n",
       "\n",
       "    # 重み減衰\n",
       "    decay = 0.0001\n",
       "\n",
       "    # 勾配上限\n",
       "    grad_clip = 3\n",
       "\n",
       "    # Setup optimizer (Optimizer の設定)\n",
       "    # optimizer = optimizers.Adam()\n",
       "    optimizer = optimizers.AdaDelta()\n",
       "    optimizer.setup(model)\n",
       "    optimizer.add_hook(chainer.optimizer.GradientClipping(grad_clip))\n",
       "    optimizer.add_hook(chainer.optimizer.WeightDecay(decay))\n",
       "\n",
       "    # プロット用に実行結果を保存する\n",
       "    train_loss = []\n",
       "    train_norm = []\n",
       "    train_accuracy = []\n",
       "    test_loss = []\n",
       "    test_accuracy = []\n",
       "\n",
       "    start_at = time.time()\n",
       "    cur_at = start_at\n",
       "\n",
       "    # Learning loop\n",
       "    for epoch in six.moves.range(1, n_epoch + 1):\n",
       "\n",
       "        print('epoch {:} / {:}'.format(epoch, n_epoch))\n",
       "        sys.stdout.flush()\n",
       "\n",
       "        # sorted_gen = batch(sorted_parallel(X_train, y_train, N * batchsize), batchsize)\n",
       "        sum_train_loss = 0.\n",
       "        sum_train_accuracy = 0.\n",
       "        K = 0\n",
       "\n",
       "        # training\n",
       "        # N 個の順番をランダムに並び替える\n",
       "        perm = np.random.permutation(N)\n",
       "        for i in six.moves.range(0, N, batchsize):\n",
       "\n",
       "            x = Variable(X_train[perm[i:i + batchsize]], volatile='off')\n",
       "            t = Variable(y_train[perm[i:i + batchsize]], volatile='off')\n",
       "\n",
       "            # 勾配を初期化\n",
       "            model.cleargrads()\n",
       "\n",
       "            # 順伝播させて誤差と精度を算出\n",
       "            loss, accuracy = model(x, t, train=True)\n",
       "\n",
       "            sum_train_loss += float(loss.data) * len(t)\n",
       "            sum_train_accuracy += float(accuracy.data) * len(t)\n",
       "            K += len(t)\n",
       "\n",
       "            # 誤差逆伝播で勾配を計算\n",
       "            loss.backward()\n",
       "            optimizer.update()\n",
       "\n",
       "        train_loss.append(sum_train_loss / K)\n",
       "        train_accuracy.append(sum_train_accuracy / K)\n",
       "\n",
       "        # 訓練データの誤差と,正解精度を表示\n",
       "        now = time.time()\n",
       "        throuput = now - cur_at\n",
       "        norm = optimizer.compute_grads_norm()\n",
       "        print('train mean loss={:.6f}, accuracy={:.6f} ({:.6f} sec)'.format(sum_train_loss / K, sum_train_accuracy / K, throuput))\n",
       "        sys.stdout.flush()\n",
       "        cur_at = now\n",
       "\n",
       "        # evaluation\n",
       "        sum_test_loss = 0.\n",
       "        sum_test_accuracy = 0.\n",
       "        K = 0\n",
       "        for i in six.moves.range(0, N_test, batchsize):\n",
       "\n",
       "            x = Variable(X_test[i:i + batchsize], volatile='on')\n",
       "            t = Variable(y_test[i:i + batchsize], volatile='on')\n",
       "\n",
       "            # 順伝播させて誤差と精度を算出\n",
       "            loss, accuracy = model(x, t, train=False)\n",
       "\n",
       "            sum_test_loss += float(loss.data) * len(t)\n",
       "            sum_test_accuracy += float(accuracy.data) * len(t)\n",
       "            K += len(t)\n",
       "\n",
       "        test_loss.append(sum_test_loss / K)\n",
       "        test_accuracy.append(sum_test_accuracy / K)\n",
       "\n",
       "        # テストデータでの誤差と正解精度を表示\n",
       "        now = time.time()\n",
       "        throuput = now - cur_at\n",
       "        print(' test mean loss={:.6f}, accuracy={:.6f} ({:.6f} sec)'.format(sum_test_loss / K, sum_test_accuracy / K, throuput))\n",
       "        sys.stdout.flush()\n",
       "        cur_at = now\n",
       "\n",
       "        # model と optimizer を保存する\n",
       "        if args.gpu >= 0: model.to_cpu()\n",
       "        with open(os.path.join(model_dir, 'epoch_{:03d}.model'.format(epoch)), 'wb') as f:\n",
       "            pickle.dump(model, f)\n",
       "        if args.gpu >= 0: model.to_gpu()\n",
       "        with open(os.path.join(model_dir, 'epoch_{:03d}.state'.format(epoch)), 'wb') as f:\n",
       "            pickle.dump(optimizer, f)\n",
       "\n",
       "        # 精度と誤差をグラフ描画\n",
       "        if True:\n",
       "            ylim1 = [min(train_loss + test_loss), max(train_loss + test_loss)]\n",
       "            ylim2 = [min(train_accuracy + test_accuracy), max(train_accuracy + test_accuracy)]\n",
       "\n",
       "            # グラフ左\n",
       "            plt.figure(figsize=(10, 10))\n",
       "            plt.subplot(1, 2, 1)\n",
       "            plt.ylim(ylim1)\n",
       "            plt.plot(range(1, len(train_loss) + 1), train_loss, 'b')\n",
       "            plt.grid()\n",
       "            plt.ylabel('loss')\n",
       "            plt.legend(['train loss', 'train l2-norm'], loc="lower left")\n",
       "            plt.twinx()\n",
       "            plt.ylim(ylim2)\n",
       "            plt.plot(range(1, len(train_accuracy) + 1), train_accuracy, 'm')\n",
       "            plt.grid()\n",
       "            # plt.ylabel('accuracy')\n",
       "            plt.legend(['train accuracy'], loc="upper left")\n",
       "            plt.title('Loss and accuracy of training.')\n",
       "\n",
       "            # グラフ右\n",
       "            plt.subplot(1, 2, 2)\n",
       "            plt.ylim(ylim1)\n",
       "            plt.plot(range(1, len(test_loss) + 1), test_loss, 'b')\n",
       "            plt.grid()\n",
       "            # plt.ylabel('loss')\n",
       "            plt.legend(['test loss'], loc="lower left")\n",
       "            plt.twinx()\n",
       "            plt.ylim(ylim2)\n",
       "            plt.plot(range(1, len(test_accuracy) + 1), test_accuracy, 'm')\n",
       "            plt.grid()\n",
       "            plt.ylabel('accuracy')\n",
       "            plt.legend(['test accuracy'], loc="upper left")\n",
       "            plt.title('Loss and accuracy of test.')\n",
       "\n",
       "            plt.savefig('{}.png'.format(model_dir))\n",
       "            # plt.show()\n",
       "\n",
       "        cur_at = now\n",
       "\n",
       "    # model と optimizer を保存する\n",
       "    if args.gpu >= 0: model.to_cpu()\n",
       "    with open(os.path.join(model_dir, 'final.model'), 'wb') as f:\n",
       "        pickle.dump(model, f)\n",
       "    if args.gpu >= 0: model.to_gpu()\n",
       "    with open(os.path.join(model_dir, 'final.state'), 'wb') as f:\n",
       "        pickle.dump(optimizer, f)\n",
       "\n",
       "print('time spent:', time.time() - start_time)\n",
       "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pprint_ipynb(data[0])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'https://github.com/haradatm/nlp/tree/master/classify/train_spp3-w2v.py'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[0].get_url() # a link to Github file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Use filter to refine search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes we are interested in searching for code snippet that contains usage of a specific class. In other cases, the criteria is a little bit more complicated, such as having \"X\" function and \"Y\" function in one file, or having \"Z\" alone. To support those need, filter is built. Let's now utilize its power to refine the search!" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from odyssey.core.bigquery.filter import Contains, And, Or" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Let's define a filter that asks for either RandomForestClassifier or RandomForestRegressor\n", "rf_classifier_or_regressor = Or(Contains('RandomForestRegressor'),Contains('RandomForestClassifier'))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Then another filter that asks for occurence of SVC\n", "svc = Contains('SVC')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Connect the two using And\n", "# so we are interested in files that have both SVC and one of the two RandomForest models \n", "# (RandomForestClassifier or RandomForestRegressor) appearing at the same time.\n", "f = And(rf_classifier_or_regressor, svc)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rf_and_svc = gp_sklearn.get_all(f)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1070\n" ] } ], "source": [ "print(len(rf_and_svc))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
# coding: utf-8\n",
       "\n",
       "# ### Open using Jupyter Notebook. It holds the code and visualizations for developing the different classification algorithms (LibSVM, RBF SVM, Naive Bayes, Random Forest, Gradient Boosting) on the chosen subset of important features. \n",
       "\n",
       "# In[27]:\n",
       "\n",
       "import pandas as pd\n",
       "import numpy as np\n",
       "from numpy import sort\n",
       "from sklearn.metrics import matthews_corrcoef, accuracy_score,confusion_matrix\n",
       "from sklearn.feature_selection import SelectFromModel\n",
       "from matplotlib import pyplot\n",
       "import pylab as pl\n",
       "from sklearn import svm\n",
       "\n",
       "get_ipython().magic(u'matplotlib inline')\n",
       "\n",
       "\n",
       "# In[4]:\n",
       "\n",
       "SEED = 1234\n",
       "## Selected set of most important features\n",
       "\n",
       "featureSet=['L3_S31_F3846','L1_S24_F1578','L3_S33_F3857','L1_S24_F1406','L3_S29_F3348','L3_S33_F3863',\n",
       "            'L3_S29_F3427','L3_S37_F3950','L0_S9_F170', 'L3_S29_F3321','L1_S24_F1346','L3_S32_F3850',\n",
       "            'L3_S30_F3514','L1_S24_F1366','L2_S26_F3036']\n",
       "\n",
       "train_x = pd.read_csv("../data/train_numeric.csv", usecols=featureSet)\n",
       "train_y = pd.read_csv("../data/train_numeric.csv", usecols=['Response'])\n",
       "\n",
       "\n",
       "# In[5]:\n",
       "\n",
       "test_x = pd.read_csv("../data/test_numeric.csv", usecols=featureSet)\n",
       "\n",
       "\n",
       "# In[6]:\n",
       "\n",
       "train_x = train_x.fillna(9999999)\n",
       "msk = np.random.rand(len(train_x)) < 0.7  # creating Training and validation set \n",
       "\n",
       "\n",
       "X_train = train_x[msk]\n",
       "\n",
       "Y_train = train_y.Response.ravel()[msk]\n",
       "\n",
       "X_valid = train_x[~msk]\n",
       "Y_valid = train_y.Response.ravel()[~msk]\n",
       "\n",
       "\n",
       "# In[7]:\n",
       "\n",
       "def showconfusionmatrix(cm, typeModel):\n",
       "    pl.matshow(cm)\n",
       "    pl.title('Confusion matrix for '+typeModel)\n",
       "    pl.colorbar()\n",
       "    pl.show()\n",
       "\n",
       "\n",
       "# In[24]:\n",
       "\n",
       "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
       "\n",
       "C=4\n",
       "lin_svc = svm.LinearSVC(C=C).fit(X_train, Y_train)\n",
       "print "LibSVM fitted"\n",
       "\n",
       "title = 'LinearSVC (linear kernel)'\n",
       "\n",
       "predicted = lin_svc.predict(X_valid)\n",
       "mcc= matthews_corrcoef(Y_valid, predicted)\n",
       "print "MCC Score \\t +"+title+str(mcc)\n",
       "\n",
       "cm = confusion_matrix(predicted, Y_valid)\n",
       "showconfusionmatrix(cm, title)\n",
       "print "Confusion Matrix"\n",
       "print (cm)\n",
       "\n",
       "\n",
       "# In[22]:\n",
       "\n",
       "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
       "\n",
       "C=4\n",
       "rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X_train, Y_train)\n",
       "print "RBF fitted"\n",
       "\n",
       "\n",
       "title = 'SVC with RBF kernel'\n",
       "\n",
       "predicted = rbf_svc.predict(X_valid)\n",
       "mcc= matthews_corrcoef(Y_valid, predicted)\n",
       "print "MCC Score \\t +"+title+str(mcc)\n",
       "\n",
       "cm = confusion_matrix(predicted, Y_valid)\n",
       "showconfusionmatrix(cm, title)\n",
       "print "Confusion Matrix"\n",
       "print (cm)\n",
       "\n",
       "\n",
       "# In[10]:\n",
       "\n",
       "\n",
       "from sklearn.naive_bayes import GaussianNB\n",
       "\n",
       "gnb = GaussianNB()\n",
       "\n",
       "clf = gnb.fit(X_train,Y_train)\n",
       "print "Naive Bayes Fitted"\n",
       "\n",
       "\n",
       "title = 'Naive Bayes'\n",
       "\n",
       "predicted = clf.predict(X_valid)\n",
       "\n",
       "\n",
       "mcc= matthews_corrcoef(Y_valid, predicted)\n",
       "print "MCC Score \\t +"+title+str(mcc)\n",
       "\n",
       "cm = confusion_matrix(predicted, Y_valid)\n",
       "showconfusionmatrix(cm, title)\n",
       "print "Confusion Matrix"\n",
       "print (cm)\n",
       "\n",
       "\n",
       "# In[21]:\n",
       "\n",
       "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
       "from sklearn.cross_validation import cross_val_score\n",
       "from sklearn.model_selection import GridSearchCV\n",
       "\n",
       "\n",
       "# In[23]:\n",
       "\n",
       "rf = RandomForestClassifier(n_estimators=20, n_jobs=2)\n",
       "param_grid = {\n",
       "                 'n_estimators': [5, 10, 15, 20],\n",
       "                 'max_depth': [2, 5, 7, 9]\n",
       "             }\n",
       "\n",
       "\n",
       "# In[24]:\n",
       "\n",
       "grid_rf = GridSearchCV(rf, param_grid, cv=10)\n",
       "rf_model=grid_rf.fit(X_train, Y_train)\n",
       "\n",
       "\n",
       "# In[30]:\n",
       "\n",
       "print "RF fitted"\n",
       "\n",
       "titles = 'Random Forest'\n",
       "\n",
       "predicted = rf_model.predict(X_valid)\n",
       "mcc= matthews_corrcoef(Y_valid, predicted)\n",
       "print "MCC Score \\t +"+titles[0]+str(mcc)\n",
       "\n",
       "cm = confusion_matrix(predicted, Y_valid)\n",
       "showconfusionmatrix(cm, titles[0])\n",
       "\n",
       "\n",
       "# In[31]:\n",
       "\n",
       "gb = GradientBoostingClassifier(learning_rate=0.5)\n",
       "param_grid = {\n",
       "                 'n_estimators': [5, 10, 15, 20],\n",
       "                 'max_depth': [2, 5, 7, 9]\n",
       "             }\n",
       "\n",
       "\n",
       "# In[32]:\n",
       "\n",
       "grid_gb = GridSearchCV(gb, param_grid, cv=10)\n",
       "gb_model=grid_gb.fit(X_train, Y_train)\n",
       "\n",
       "\n",
       "# In[36]:\n",
       "\n",
       "print "GB fitted"\n",
       "\n",
       "title = 'Gradient Boosting'\n",
       "\n",
       "predicted = gb_model.predict(X_valid)\n",
       "mcc= matthews_corrcoef(Y_valid, predicted)\n",
       "print "MCC Score \\t +"+title+str(mcc)\n",
       "\n",
       "cm = confusion_matrix(predicted, Y_valid)\n",
       "showconfusionmatrix(cm, title)\n",
       "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Verify the occurence. We indeed have both!\n", "pprint_ipynb(rf_and_svc[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3: Repos with top imports" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One common question Python library writers (or even users) are interested in is: who is using this library? Odyssey supports querying repos with top imports of your package-in-interest. In one line, you can get the answer!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note: The first time running this will be very slow!**" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "1000\n", "2000\n", "3000\n", "4000\n", "5000\n", "6000\n", "7000\n", "8000\n", "9000\n", "10000\n", "11000\n", "12000\n", "13000\n", "14000\n", "15000\n", "16000\n", "17000\n", "18000\n", "19000\n", "20000\n", "21000\n", "22000\n", "23000\n", "24000\n", "25000\n", "26000\n", "27000\n", "28000\n", "29000\n", "30000\n", "31000\n", "32000\n", "33000\n", "34000\n", "35000\n", "36000\n", "37000\n" ] } ], "source": [ "top20_imports = gp_sklearn.get_top_import_repo(n=20) # top imports by file count" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('ngoix/OCRF', 291), ('automl/auto-sklearn', 195), ('hmendozap/auto-sklearn', 186), ('florian-f/sklearn', 146), ('seckcoder/lang-learn', 141), ('GbalsaC/bitnamiP', 119), ('automl/paramsklearn', 100), ('chaluemwut/fbserver', 99), ('magic2du/contact_matrix', 96), ('nok/sklearn-porter', 95), ('jpzk/evopy', 87), ('B3AU/waveTree', 77), ('sinhrks/expandas', 64), ('chkoar/imbalanced-learn', 61), ('liyu1990/sklearn', 61), ('KennyCandy/HAR', 54), ('sinhrks/pandas-ml', 54), ('RecipeML/Recipe', 52), ('dvro/imbalanced-learn', 51), ('Tjorriemorrie/trading', 51)]\n" ] } ], "source": [ "print(top20_imports)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "20\n" ] } ], "source": [ "# Verify that the the count matches\n", "print(len(top20_imports)) # 20" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 4: Most imported class/submodule/funcion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another common question is how often a certain class/submodule/function is imported. Odyssey can answer that too." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "top20_models = gp_sklearn.get_most_imported_class(n=20)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('RandomForestClassifier', 2534), ('LogisticRegression', 2152), ('SVC', 1998), ('StandardScaler', 1783), ('PCA', 1732), ('Pipeline', 1519), ('GridSearchCV', 1511), ('KMeans', 1451), ('TfidfVectorizer', 1314), ('CountVectorizer', 1294), ('KNeighborsClassifier', 1188), ('LinearSVC', 1116), ('DecisionTreeClassifier', 1047), ('LinearRegression', 861), ('GaussianNB', 817), ('LabelEncoder', 728), ('MultinomialNB', 723), ('RandomForestRegressor', 681), ('AdaBoostClassifier', 673), ('SGDClassifier', 642)]\n" ] } ], "source": [ "print(top20_models)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See what are the entries by calling get_import_source() " ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sources = gp_sklearn.get_import_source(\"RandomForestClassifier\")" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
# coding: utf-8\n",
       "\n",
       "# ### Open using Jupyter Notebook. It holds the code and visualizations for developing the different classification algorithms (LibSVM, RBF SVM, Naive Bayes, Random Forest, Gradient Boosting) on the chosen subset of important features. \n",
       "\n",
       "# In[27]:\n",
       "\n",
       "import pandas as pd\n",
       "import numpy as np\n",
       "from numpy import sort\n",
       "from sklearn.metrics import matthews_corrcoef, accuracy_score,confusion_matrix\n",
       "from sklearn.feature_selection import SelectFromModel\n",
       "from matplotlib import pyplot\n",
       "import pylab as pl\n",
       "from sklearn import svm\n",
       "\n",
       "get_ipython().magic(u'matplotlib inline')\n",
       "\n",
       "\n",
       "# In[4]:\n",
       "\n",
       "SEED = 1234\n",
       "## Selected set of most important features\n",
       "\n",
       "featureSet=['L3_S31_F3846','L1_S24_F1578','L3_S33_F3857','L1_S24_F1406','L3_S29_F3348','L3_S33_F3863',\n",
       "            'L3_S29_F3427','L3_S37_F3950','L0_S9_F170', 'L3_S29_F3321','L1_S24_F1346','L3_S32_F3850',\n",
       "            'L3_S30_F3514','L1_S24_F1366','L2_S26_F3036']\n",
       "\n",
       "train_x = pd.read_csv("../data/train_numeric.csv", usecols=featureSet)\n",
       "train_y = pd.read_csv("../data/train_numeric.csv", usecols=['Response'])\n",
       "\n",
       "\n",
       "# In[5]:\n",
       "\n",
       "test_x = pd.read_csv("../data/test_numeric.csv", usecols=featureSet)\n",
       "\n",
       "\n",
       "# In[6]:\n",
       "\n",
       "train_x = train_x.fillna(9999999)\n",
       "msk = np.random.rand(len(train_x)) < 0.7  # creating Training and validation set \n",
       "\n",
       "\n",
       "X_train = train_x[msk]\n",
       "\n",
       "Y_train = train_y.Response.ravel()[msk]\n",
       "\n",
       "X_valid = train_x[~msk]\n",
       "Y_valid = train_y.Response.ravel()[~msk]\n",
       "\n",
       "\n",
       "# In[7]:\n",
       "\n",
       "def showconfusionmatrix(cm, typeModel):\n",
       "    pl.matshow(cm)\n",
       "    pl.title('Confusion matrix for '+typeModel)\n",
       "    pl.colorbar()\n",
       "    pl.show()\n",
       "\n",
       "\n",
       "# In[24]:\n",
       "\n",
       "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
       "\n",
       "C=4\n",
       "lin_svc = svm.LinearSVC(C=C).fit(X_train, Y_train)\n",
       "print "LibSVM fitted"\n",
       "\n",
       "title = 'LinearSVC (linear kernel)'\n",
       "\n",
       "predicted = lin_svc.predict(X_valid)\n",
       "mcc= matthews_corrcoef(Y_valid, predicted)\n",
       "print "MCC Score \\t +"+title+str(mcc)\n",
       "\n",
       "cm = confusion_matrix(predicted, Y_valid)\n",
       "showconfusionmatrix(cm, title)\n",
       "print "Confusion Matrix"\n",
       "print (cm)\n",
       "\n",
       "\n",
       "# In[22]:\n",
       "\n",
       "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
       "\n",
       "C=4\n",
       "rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X_train, Y_train)\n",
       "print "RBF fitted"\n",
       "\n",
       "\n",
       "title = 'SVC with RBF kernel'\n",
       "\n",
       "predicted = rbf_svc.predict(X_valid)\n",
       "mcc= matthews_corrcoef(Y_valid, predicted)\n",
       "print "MCC Score \\t +"+title+str(mcc)\n",
       "\n",
       "cm = confusion_matrix(predicted, Y_valid)\n",
       "showconfusionmatrix(cm, title)\n",
       "print "Confusion Matrix"\n",
       "print (cm)\n",
       "\n",
       "\n",
       "# In[10]:\n",
       "\n",
       "\n",
       "from sklearn.naive_bayes import GaussianNB\n",
       "\n",
       "gnb = GaussianNB()\n",
       "\n",
       "clf = gnb.fit(X_train,Y_train)\n",
       "print "Naive Bayes Fitted"\n",
       "\n",
       "\n",
       "title = 'Naive Bayes'\n",
       "\n",
       "predicted = clf.predict(X_valid)\n",
       "\n",
       "\n",
       "mcc= matthews_corrcoef(Y_valid, predicted)\n",
       "print "MCC Score \\t +"+title+str(mcc)\n",
       "\n",
       "cm = confusion_matrix(predicted, Y_valid)\n",
       "showconfusionmatrix(cm, title)\n",
       "print "Confusion Matrix"\n",
       "print (cm)\n",
       "\n",
       "\n",
       "# In[21]:\n",
       "\n",
       "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
       "from sklearn.cross_validation import cross_val_score\n",
       "from sklearn.model_selection import GridSearchCV\n",
       "\n",
       "\n",
       "# In[23]:\n",
       "\n",
       "rf = RandomForestClassifier(n_estimators=20, n_jobs=2)\n",
       "param_grid = {\n",
       "                 'n_estimators': [5, 10, 15, 20],\n",
       "                 'max_depth': [2, 5, 7, 9]\n",
       "             }\n",
       "\n",
       "\n",
       "# In[24]:\n",
       "\n",
       "grid_rf = GridSearchCV(rf, param_grid, cv=10)\n",
       "rf_model=grid_rf.fit(X_train, Y_train)\n",
       "\n",
       "\n",
       "# In[30]:\n",
       "\n",
       "print "RF fitted"\n",
       "\n",
       "titles = 'Random Forest'\n",
       "\n",
       "predicted = rf_model.predict(X_valid)\n",
       "mcc= matthews_corrcoef(Y_valid, predicted)\n",
       "print "MCC Score \\t +"+titles[0]+str(mcc)\n",
       "\n",
       "cm = confusion_matrix(predicted, Y_valid)\n",
       "showconfusionmatrix(cm, titles[0])\n",
       "\n",
       "\n",
       "# In[31]:\n",
       "\n",
       "gb = GradientBoostingClassifier(learning_rate=0.5)\n",
       "param_grid = {\n",
       "                 'n_estimators': [5, 10, 15, 20],\n",
       "                 'max_depth': [2, 5, 7, 9]\n",
       "             }\n",
       "\n",
       "\n",
       "# In[32]:\n",
       "\n",
       "grid_gb = GridSearchCV(gb, param_grid, cv=10)\n",
       "gb_model=grid_gb.fit(X_train, Y_train)\n",
       "\n",
       "\n",
       "# In[36]:\n",
       "\n",
       "print "GB fitted"\n",
       "\n",
       "title = 'Gradient Boosting'\n",
       "\n",
       "predicted = gb_model.predict(X_valid)\n",
       "mcc= matthews_corrcoef(Y_valid, predicted)\n",
       "print "MCC Score \\t +"+title+str(mcc)\n",
       "\n",
       "cm = confusion_matrix(predicted, Y_valid)\n",
       "showconfusionmatrix(cm, title)\n",
       "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pprint_ipynb(sources[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 5: Instantiation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For classes, Odyssey can provide you with insights about how they are instantiated, default argument value people use, etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note: All the arguments in the returned dictionary are in string format (even for integer values). This may be changed later.**" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rfc_instantiation = gp_sklearn.get_instantiation(\"RandomForestClassifier\")" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "defaultdict(.>,\n", " {'*': defaultdict(int, {None: 4}),\n", " '**': defaultdict(int, {None: 24}),\n", " '**args': defaultdict(int, {None: 2}),\n", " '**classi_params': defaultdict(int, {None: 1}),\n", " '**classif_base.get_params()': defaultdict(int, {None: 1}),\n", " '**classifier_pram_dic[rf_name]': defaultdict(int, {None: 1}),\n", " \"**clf.get('config')\": defaultdict(int, {None: 2}),\n", " '**clf_args_': defaultdict(int, {None: 2}),\n", " '**clf_params': defaultdict(int, {None: 1}),\n", " '**cls_kwargs': defaultdict(int, {None: 1}),\n", " '**config_clf': defaultdict(int, {None: 1}),\n", " '**estimator_params': defaultdict(int, {None: 4}),\n", " '**forest_parms': defaultdict(int, {None: 2}),\n", " '**gs.best_params_': defaultdict(int, {None: 1}),\n", " '**job': defaultdict(int, {None: 1}),\n", " '**kwargs': defaultdict(int, {None: 10}),\n", " '**model_params': defaultdict(int, {None: 3}),\n", " '**params': defaultdict(int, {None: 13}),\n", " '**params_used': defaultdict(int, {None: 1}),\n", " '**parse.config_for_function(RandomForestClassifier.__init__, config)': defaultdict(int,\n", " {None: 2}),\n", " '**rf_config': defaultdict(int, {None: 1}),\n", " '**rf_parameters': defaultdict(int, {None: 3}),\n", " '**rf_params': defaultdict(int, {None: 1}),\n", " '**self.param_dict': defaultdict(int, {None: 1}),\n", " '**self.params': defaultdict(int, {None: 4}),\n", " \"**{'n_estimators' : 7500, 'max_depth' : 200}\": defaultdict(int,\n", " {None: 2}),\n", " 'bootstrap': defaultdict(int,\n", " {'False': 26,\n", " 'True': 62,\n", " 'bootstrap': 2,\n", " 'bs': 1,\n", " 'bstp': 1,\n", " 'config[\"rf:bootstrap\"]': 1,\n", " 'context[\"classifiers\"][classifier_name][\"learning_algorithm\"][\"parameters\"][\"bootstrap\"]': 1,\n", " 'estimator.best_estimator_.bootstrap': 2,\n", " 'p2': 1,\n", " 'param_bootstrap': 2,\n", " 'self.bootstrap': 4,\n", " 'self.bootstrap_forest': 1,\n", " 'settings[\"bootstrap\"]': 1}),\n", " 'class_weight': defaultdict(int,\n", " {' {0: 1, 1:10}': 3,\n", " ' {0:0.098, \\n 1:0.111, \\n 2:0.104, \\n 3:0.102, \\n 4:0.098, \\n 5:0.088, \\n 6:0.095, \\n 7:0.103, \\n 8:0.098, \\n 9:0.102}': 2,\n", " '\"auto\"': 3,\n", " '\"balanced\"': 35,\n", " '\"balanced_subsample\"': 1,\n", " \"'auto'\": 16,\n", " \"'balanced'\": 26,\n", " \"'balanced_subsample'\": 5,\n", " \"'subsample'\": 1,\n", " 'None': 21,\n", " 'class_weight': 5,\n", " 'class_wt': 1,\n", " 'cw': 1,\n", " 'param_class_weight': 2,\n", " \"params['class_weight']\": 1,\n", " 'rf_weights': 1,\n", " 'self.class_weight': 5,\n", " 'self.class_weight_forest': 1,\n", " 'weight': 2,\n", " '{0: 1, 1: 28}': 2,\n", " \"{0: 1, 1: space['cw']}\": 1,\n", " \"{0:1, 1: space['cw']}\": 1,\n", " '{0:100, 1:1}': 1,\n", " '{1:weights*np.count_nonzero(Y)/len(Y),0:1-(np.count_nonzero(Y)/len(Y))}': 1,\n", " '{False:1, True:1}': 1}),\n", " 'compute_importances': defaultdict(int,\n", " {'False': 2, 'None': 1, 'True': 25}),\n", " 'criterion': defaultdict(int,\n", " {'\"entropy\"': 37,\n", " '\"gini\"': 9,\n", " \"'entropy'\": 66,\n", " \"'gini'\": 81,\n", " 'CRIT': 1,\n", " 'args[1]': 1,\n", " 'c': 5,\n", " 'config[\"rf:criterion\"]': 1,\n", " 'context[\"classifiers\"][classifier_name][\"learning_algorithm\"][\"parameters\"][\"criterion\"]': 1,\n", " 'crit': 1,\n", " 'crit_out': 6,\n", " 'criterion': 13,\n", " \"criterion[best['criterion']]\": 2,\n", " 'criterion_t': 1,\n", " 'estimator.best_estimator_.criterion': 2,\n", " 'feature': 1,\n", " 'p1': 1,\n", " 'param_criterion': 2,\n", " \"params['criterion']\": 1,\n", " 'rf_criterion': 3,\n", " 'self.criterion': 6,\n", " 'self.criterion_forest': 1,\n", " 'settings[\"criterion\"]': 1,\n", " 'splitcriteria_param': 1}),\n", " 'featuresCol': defaultdict(int, {'\"features\"': 1}),\n", " 'labelCol': defaultdict(int, {'\"Response\"': 1}),\n", " 'maxDepth': defaultdict(int, {'15': 1}),\n", " 'max_depth': defaultdict(int,\n", " {'1': 6,\n", " '10': 36,\n", " '100': 10,\n", " '13': 5,\n", " '14': 1,\n", " '15': 8,\n", " '16': 10,\n", " '17': 3,\n", " '2': 11,\n", " '20': 4,\n", " '2000': 1,\n", " '22': 1,\n", " '25': 5,\n", " '3': 11,\n", " '30': 2,\n", " '4': 11,\n", " '40': 3,\n", " '5': 57,\n", " '50': 9,\n", " '52': 14,\n", " '6': 5,\n", " '60': 2,\n", " '600': 1,\n", " '7': 6,\n", " '700': 1,\n", " '8': 7,\n", " '80': 2,\n", " '9': 2,\n", " 'C': 2,\n", " 'None': 126,\n", " 'RFC_depth': 6,\n", " 'RF_depth': 1,\n", " 'TREE_DEPTH': 2,\n", " '_max_depth': 1,\n", " 'args[\"max_tree_nodes\"]': 1,\n", " 'args[2]': 1,\n", " 'best_m': 1,\n", " \"best_pars['max_depth']\": 1,\n", " 'config[\"rf:max_depth\"]': 1,\n", " 'context[\"classifiers\"][classifier_name][\"learning_algorithm\"][\"parameters\"][\"max_depth\"]': 1,\n", " 'depth': 7,\n", " 'depth_out': 4,\n", " 'estimator.best_estimator_.max_depth': 2,\n", " 'feature': 1,\n", " \"grid_search.best_params_['max_depth']\": 1,\n", " 'hyper_parameter': 2,\n", " 'length': 1,\n", " 'm': 1,\n", " 'm_d': 2,\n", " 'm_dep': 3,\n", " 'maxDepth[0]': 1,\n", " 'max_D': 1,\n", " 'max_d': 2,\n", " 'max_dep': 9,\n", " 'max_depth': 24,\n", " 'max_depth_option': 1,\n", " 'max_tree_depth': 1,\n", " 'md': 2,\n", " 'p4': 1,\n", " 'param_max_depth': 2,\n", " \"params['max_depth']\": 1,\n", " \"paras['rf'][0]\": 3,\n", " 'rf_max_depth': 3,\n", " \"self._settings.get('max_depth', 10)\": 1,\n", " 'self.k': 1,\n", " 'self.max_depth': 6,\n", " 'self.max_depth_forest': 1,\n", " \"space['max_depth']\": 3}),\n", " 'max_features': defaultdict(int,\n", " {' int(math.sqrt(features))': 1,\n", " \" params['max_features']\": 1,\n", " '\"auto\"': 17,\n", " '\"log2\"': 7,\n", " '\"sqrt\"': 25,\n", " \"'auto'\": 64,\n", " \"'log2'\": 7,\n", " \"'sqrt'\": 17,\n", " '.33': 1,\n", " '0.1': 2,\n", " '0.2': 1,\n", " '0.4': 3,\n", " '0.497907908371': 1,\n", " '0.5': 2,\n", " '0.59': 1,\n", " '0.6': 1,\n", " '0.7': 1,\n", " '0.8': 1,\n", " '1': 59,\n", " '1.': 1,\n", " '1.0/3': 1,\n", " '10': 6,\n", " '100': 2,\n", " '128': 1,\n", " '15': 1,\n", " '16': 1,\n", " '2': 3,\n", " '20': 2,\n", " '200': 1,\n", " '3': 5,\n", " '30': 1,\n", " '375': 1,\n", " '38': 1,\n", " '4': 4,\n", " '5': 9,\n", " '50': 1,\n", " '500': 2,\n", " '7': 3,\n", " '8': 1,\n", " '80': 1,\n", " 'None': 43,\n", " 'R': 1,\n", " 'SILLY_NUMBER': 1,\n", " \"best_params[dataset_name][method_name]['rf_max_features']\": 1,\n", " \"best_params[dataset_name][method_name][nr_events]['rf_max_features']\": 2,\n", " \"best_pars['max_features']\": 1,\n", " 'c_max_features': 1,\n", " 'config[\\n \"rf:max_features\"]': 1,\n", " 'context[\"classifiers\"][classifier_name][\"learning_algorithm\"][\"parameters\"][\"max_features\"]': 1,\n", " 'feature': 4,\n", " 'features': 3,\n", " \"grid_search.best_params_['max_features']\": 1,\n", " 'individual[2]': 1,\n", " 'int(math.sqrt(n_features))': 2,\n", " 'int(mtry)': 1,\n", " 'int(np.sqrt(len(self.dataframe.columns)))': 1,\n", " 'k': 2,\n", " 'm_f': 2,\n", " 'm_feat': 3,\n", " 'max_f': 3,\n", " 'max_feat_out': 5,\n", " 'max_feature': 1,\n", " 'max_features': 28,\n", " 'max_features_options': 1,\n", " 'mf': 4,\n", " 'min(49, len(result1.columns) - 1)': 1,\n", " 'min(52, len(result1.columns) - 1)': 1,\n", " 'min(64, len(result2.columns) - 1)': 1,\n", " 'mtry': 2,\n", " 'n_feat': 2,\n", " 'n_features': 1,\n", " 'p5': 1,\n", " 'param_max_features': 2,\n", " \"params['max_features']\": 1,\n", " \"paras['rf'][2]\": 3,\n", " 'rf_max_features': 7,\n", " 'rf_no_active_vars': 3,\n", " 'self.__max_features': 2,\n", " 'self.max_features': 3,\n", " 'self.max_features_forest': 1,\n", " 'settings[\"max_features\"]': 1,\n", " \"space['max_features']\": 3,\n", " 'total_features': 2,\n", " 'tree_features': 2,\n", " 'tunings[1] / 100': 8}),\n", " 'max_leaf_nodes': defaultdict(int,\n", " {'1000': 3,\n", " '365': 14,\n", " '50': 2,\n", " 'None': 26,\n", " 'feature': 1,\n", " 'int(tunings[4])': 1,\n", " 'max_leaf_nodes_options': 1,\n", " 'mln': 1,\n", " 'node_out': 1,\n", " 'param_max_leaf_nodes': 1,\n", " \"params['max_leaf_nodes']\": 1,\n", " 'self.max_leaf_nodes': 3,\n", " 'self.max_leaf_nodes_forest': 1}),\n", " 'minInstances': defaultdict(int, {'10': 1}),\n", " 'min_impurity_split': defaultdict(int,\n", " {'0.1': 1, '1e-07': 7, '1e-7': 1}),\n", " 'min_samples_leaf': defaultdict(int,\n", " {'1': 31,\n", " '1.0': 1,\n", " '10': 6,\n", " '100': 1,\n", " '1000': 1,\n", " '15': 1,\n", " '150': 1,\n", " '2': 23,\n", " '20': 8,\n", " '200': 2,\n", " '3': 6,\n", " '365': 14,\n", " '4': 3,\n", " '5': 13,\n", " '6': 1,\n", " '8': 10,\n", " '9': 1,\n", " 'args[\"min_samples_leaf\"]': 1,\n", " 'best_param': 1,\n", " \"best_pars['msl']\": 1,\n", " 'config[\\n \"rf:min_samples_leaf\"]': 1,\n", " 'context[\"classifiers\"][classifier_name][\"learning_algorithm\"][\"parameters\"][\\n \"min_samples_leaf\"]': 1,\n", " 'individual[1]': 1,\n", " 'int(np.round(x[i]))': 2,\n", " 'int(settings[\"min_sample_leaf\"])': 1,\n", " 'int(tunings[2])': 8,\n", " 'leaf_size': 2,\n", " 'm_s_l': 2,\n", " 'm_sam_leaf': 3,\n", " 'min_samples_at_leaf': 1,\n", " 'min_samples_leaf': 10,\n", " 'min_samples_leaf_options': 1,\n", " 'msl': 4,\n", " 'n': 8,\n", " 'nodes': 6,\n", " 'p7': 1,\n", " 'param': 1,\n", " 'param_min_samples_leaf': 2,\n", " \"params['min_samples_leaf']\": 1,\n", " 'self.min_samples_leaf': 6,\n", " 'self.min_samples_leaf_forest': 1,\n", " \"space['msl']\": 3,\n", " 'val': 1}),\n", " 'min_samples_split': defaultdict(int,\n", " {'0.02': 1,\n", " '1': 80,\n", " '10': 14,\n", " '100': 10,\n", " '1000': 1,\n", " '12': 4,\n", " '13': 2,\n", " '15': 2,\n", " '16': 3,\n", " '163': 1,\n", " '17': 5,\n", " '2': 67,\n", " '2*min_samples_at_leaf': 1,\n", " '20': 1,\n", " '25': 1,\n", " '256': 1,\n", " '3': 2,\n", " '30': 1,\n", " '32': 1,\n", " '4': 15,\n", " '5': 7,\n", " '50': 5,\n", " '7': 1,\n", " '70': 2,\n", " '76': 14,\n", " '8': 4,\n", " '9': 4,\n", " 'args[\"min_samples_split\"]': 1,\n", " \"best_pars['mss']\": 1,\n", " 'config[\\n \"rf:min_samples_split\"]': 1,\n", " 'context[\"classifiers\"][classifier_name][\"learning_algorithm\"][\"parameters\"][\\n \"min_samples_split\"]': 1,\n", " 'feature': 1,\n", " 'individual[0]': 1,\n", " 'int(settings[\"min_sample_split\"])': 1,\n", " 'int(tunings[3])': 8,\n", " 'len(self.x) / 8': 1,\n", " 'm_s_s': 2,\n", " 'min_sample': 1,\n", " 'min_samples': 4,\n", " 'min_samples_spl': 1,\n", " 'min_samples_split': 6,\n", " 'nodes*2': 6,\n", " 'p6': 1,\n", " 'param_min_samples_split': 2,\n", " \"params['min_samples_split']\": 1,\n", " 'rf_min_sample_count': 3,\n", " 'sample_out': 3,\n", " 'self.min_samples_split': 4,\n", " 'self.min_samples_split_forest': 1,\n", " \"space['mss']\": 3}),\n", " 'min_weight_fraction_leaf': defaultdict(int,\n", " {'0': 6,\n", " '0.0': 15,\n", " '0.1': 1,\n", " '0.5': 1,\n", " 'feature': 1,\n", " 'frac_out': 2,\n", " 'int(settings[\"min_weight_faction_leaf\"])': 1,\n", " 'min_weight_fraction_leaf': 2,\n", " 'mwfl': 1,\n", " 'param_min_weight_fraction_leaf': 1,\n", " 'self.min_weight_fraction_leaf': 3,\n", " 'self.min_weight_fraction_leaf_forest': 1}),\n", " 'n_estimators': defaultdict(int,\n", " {' n_estimators/2': 1,\n", " \" params['n_estimators']\": 1,\n", " ' pm.num_trees': 2,\n", " ' self.RF_size': 1,\n", " ' self.n_estimators': 1,\n", " ' self.n_trees': 2,\n", " ' self.ntrees': 1,\n", " '0': 3,\n", " '1': 20,\n", " '10': 206,\n", " '100': 435,\n", " '1000': 69,\n", " '10000': 9,\n", " '101': 2,\n", " '1024': 5,\n", " '11': 3,\n", " '12': 2,\n", " '120': 4,\n", " '1200': 4,\n", " '12000': 1,\n", " '128': 2,\n", " '13': 2,\n", " '1400': 2,\n", " '15': 11,\n", " '150': 27,\n", " '1500': 3,\n", " '15000': 1,\n", " '17': 1,\n", " '18': 2,\n", " '180': 1,\n", " '196': 1,\n", " '198': 14,\n", " '1999': 1,\n", " '2': 3,\n", " '20': 44,\n", " '20*8': 1,\n", " '200': 48,\n", " '2000': 17,\n", " '22': 1,\n", " '23': 1,\n", " '240': 2,\n", " '25': 32,\n", " '250': 10,\n", " '2500': 3,\n", " '256': 6,\n", " '3': 3,\n", " '30': 32,\n", " '300': 38,\n", " '3000': 6,\n", " '30000': 1,\n", " '32': 3,\n", " '34': 1,\n", " '35': 3,\n", " '350': 1,\n", " '4': 2,\n", " '40': 20,\n", " '400': 7,\n", " '48': 1,\n", " '5': 10,\n", " '50': 97,\n", " '500': 122,\n", " '5000': 7,\n", " '51': 3,\n", " '512': 4,\n", " '52': 2,\n", " '55': 1,\n", " '550': 1,\n", " '6': 1,\n", " '60': 4,\n", " '600': 1,\n", " '625': 1,\n", " '64': 3,\n", " '65': 1,\n", " '7': 1,\n", " '700': 2,\n", " '75': 2,\n", " '750': 3,\n", " '8': 1,\n", " '80': 6,\n", " '800': 4,\n", " '8000': 1,\n", " '84': 1,\n", " '850': 2,\n", " '9': 1,\n", " '90': 2,\n", " '900': 1,\n", " '91': 1,\n", " '94': 3,\n", " '95': 1,\n", " '99': 3,\n", " 'C': 1,\n", " 'NEST': 1,\n", " 'R': 1,\n", " 'RFC_estimators': 6,\n", " 'RF_estimators': 1,\n", " 'RF_size': 3,\n", " 'args.ntrees': 1,\n", " 'args[\"num_trees\"]': 1,\n", " 'args[0]': 1,\n", " \"best['n_estimators']\": 2,\n", " 'best_n': 1,\n", " 'best_param_rf.get(\"n_estimators\")': 2,\n", " \"best_pars['n_estimators']\": 1,\n", " 'config[\"rf:n_estimators\"]': 1,\n", " 'context[\"classifiers\"][classifier_name][\"learning_algorithm\"][\"parameters\"][\"n_estimators\"]': 1,\n", " 'e': 1,\n", " 'est': 1,\n", " 'estimator': 4,\n", " 'estimator_param': 1,\n", " 'estimators': 5,\n", " 'feature': 1,\n", " 'i': 3,\n", " 'idx + 1': 2,\n", " 'individual[3]': 1,\n", " 'inner_estimators': 1,\n", " 'int(SILLY_NUMBER*1.5)': 1,\n", " 'int(len(MetricEntry.metrics)/3)': 1,\n", " 'int(numbtrees_param)': 1,\n", " 'int(settings[\"n_estimators\"])': 1,\n", " 'int(tunings[0])': 8,\n", " 'lNbEstimatorsInEnsembles': 2,\n", " 'max_random_trees': 2,\n", " 'min_log_loss_iter': 1,\n", " \"model_param['n_estimators']\": 1,\n", " 'mp.random_forest_estimators': 1,\n", " 'n': 15,\n", " 'n_cpu*trees_per_compute': 2,\n", " 'n_est': 11,\n", " 'n_estim': 5,\n", " 'n_estimator': 1,\n", " 'n_estimators': 73,\n", " 'n_estimators[0]': 1,\n", " 'n_estimators_options': 1,\n", " 'n_estimators_size': 2,\n", " 'n_out': 7,\n", " 'n_tree': 5,\n", " 'n_trees': 15,\n", " 'ne': 1,\n", " 'nest': 2,\n", " 'nr_of_trees': 1,\n", " 'nr_trees': 1,\n", " 'ntrees': 11,\n", " 'num': 1,\n", " 'numE': 1,\n", " 'numTrees': 2,\n", " 'num_estimators': 1,\n", " 'num_trees': 7,\n", " 'opts.estimators': 3,\n", " 'opts.numtrees': 4,\n", " 'p3': 1,\n", " 'param_n_estimators': 1,\n", " \"params['n_estimators']\": 1,\n", " \"paras['rf'][1]\": 3,\n", " 'rf_max_num_trees': 3,\n", " 'rf_n_estimators': 10,\n", " 'self.Nestimators': 1,\n", " 'self.__n_TreesInForest': 1,\n", " 'self.__n_estimators': 2,\n", " \"self._settings.get('trees', 10)\": 1,\n", " 'self.config.hid_layer_units': 1,\n", " 'self.config.hid_layer_units_baseline': 1,\n", " 'self.n_estimators': 16,\n", " 'self.n_estimators_forest': 1,\n", " 'self.n_trees': 1,\n", " 'self.numTrees': 5,\n", " \"self.params['num_estimators']\": 1,\n", " 'self.randomForestEstimators': 1,\n", " \"space['n']\": 1,\n", " \"space['n_estimators']\": 2,\n", " 'sqrt_feat_num': 1,\n", " 'trees': 9,\n", " 'val': 1}),\n", " 'n_jobs': defaultdict(int,\n", " {' -1': 31,\n", " ' pm.n_jobs': 1,\n", " ' self.n_jobs': 1,\n", " '-1': 391,\n", " '-2': 1,\n", " '1': 55,\n", " '10': 11,\n", " '12': 8,\n", " '15': 1,\n", " '16': 4,\n", " '2': 66,\n", " '3': 8,\n", " '4': 45,\n", " '40': 1,\n", " '5': 20,\n", " '6': 3,\n", " '7': 4,\n", " '8': 14,\n", " 'NUM_THREADS': 2,\n", " 'PROCESSORS': 1,\n", " 'args.cpu': 1,\n", " 'args.njobs': 1,\n", " 'cores': 1,\n", " 'cpu_counts': 1,\n", " 'cpus': 2,\n", " 'int(settings[\"n_jobs\"])': 1,\n", " 'jobs': 5,\n", " 'n_cores': 1,\n", " 'n_cpu': 2,\n", " 'n_estimators': 1,\n", " 'n_jobs': 29,\n", " 'njobs': 3,\n", " 'numJobs': 1,\n", " 'num_jobs': 2,\n", " 'number_of_threads': 1,\n", " 'options.n_jobs': 1,\n", " 'options.pyxit_n_jobs': 1,\n", " 'opts.nprocessors': 1,\n", " 'opts.numproc': 1,\n", " 'param_n_jobs': 2,\n", " 'self.n_jobs': 10,\n", " 'self.n_jobs_forest': 1,\n", " 'self.nthreads': 1,\n", " 'self.parallel_jobs': 1,\n", " \"self.params['num_jobs']\": 1,\n", " 'self.threadCount': 1,\n", " 'workers': 3}),\n", " 'numTrees': defaultdict(int, {'60': 1}),\n", " 'oob_score': defaultdict(int,\n", " {'1': 4,\n", " 'False': 34,\n", " 'True': 106,\n", " 'oob_score': 1,\n", " 'os': 1,\n", " 'param_oob_score': 2,\n", " 'self.oob_score_forest': 1}),\n", " 'random_state': defaultdict(int,\n", " {' self.ran_stat': 2,\n", " '0': 217,\n", " '1': 116,\n", " '10': 2,\n", " '1000 + l': 2,\n", " '1000+l': 2,\n", " '1104': 1,\n", " '123': 13,\n", " '1234': 1,\n", " '12345': 4,\n", " '125': 1,\n", " '13': 4,\n", " '1301': 1,\n", " '131': 4,\n", " '1337': 1,\n", " '142': 1,\n", " '144': 2,\n", " '150': 1,\n", " '17': 1,\n", " '192': 3,\n", " '1960': 3,\n", " '2': 4,\n", " '20': 3,\n", " '2016': 1,\n", " '21': 1,\n", " '234': 1,\n", " '241': 1,\n", " '2543': 2,\n", " '30': 4,\n", " '32': 1,\n", " '321': 1,\n", " '324089': 2,\n", " '33': 2,\n", " '4': 8,\n", " '4141': 1,\n", " '42': 61,\n", " '451': 1,\n", " '5': 2,\n", " '50': 8,\n", " '571': 1,\n", " '600': 1,\n", " '7': 4,\n", " '7112016': 8,\n", " '77': 2,\n", " '782629': 1,\n", " '84': 3,\n", " '87': 1,\n", " '88': 1,\n", " '93758': 1,\n", " 'None': 31,\n", " 'RANDOM_STATE': 7,\n", " 'RDM': 1,\n", " 'RND_SEED': 3,\n", " 'RandomState(__seed__)': 1,\n", " 'RandomState(seed)': 1,\n", " 'SEED': 1,\n", " 'args[\"seed\"]': 1,\n", " 'choosen_random_state': 2,\n", " 'generator': 2,\n", " 'i': 6,\n", " 'n': 1,\n", " 'np.random.RandomState(0)': 1,\n", " 'param_random_state': 2,\n", " 'prng': 1,\n", " 'rand': 1,\n", " 'rand_state': 1,\n", " 'random': 9,\n", " 'random_seed': 1,\n", " 'random_state': 37,\n", " 'randomseedcounter': 4,\n", " 'rng': 5,\n", " 'seed': 7,\n", " 'self.random_state': 12,\n", " 'self.random_state_forest': 1,\n", " 'self.rng': 1,\n", " 'self.rs': 2,\n", " 'self.seed': 2,\n", " 'settings[\"random_state\"]': 1}),\n", " 'seed': defaultdict(int, {'1111': 1}),\n", " 'verbose': defaultdict(int,\n", " {'(\\n 2 if debug is True else 0)': 1,\n", " '(2 if debug is True else 0)': 1,\n", " '(args.loglevel == logging.DEBUG)': 1,\n", " '0': 37,\n", " '1': 29,\n", " '10': 3,\n", " '2': 30,\n", " '20': 15,\n", " '3': 6,\n", " '42': 1,\n", " 'False': 3,\n", " 'True': 8,\n", " 'VERBOSE': 2,\n", " 'int(settings[\"verbose\"])': 1,\n", " 'options.verbose': 1,\n", " 'param_verbose': 2,\n", " 'self.verbose_forest': 1,\n", " 'verbose': 19}),\n", " 'warm_start': defaultdict(int,\n", " {'False': 22,\n", " 'True': 20,\n", " 'param_warm_start': 2,\n", " 'self.warm_start_forest': 1,\n", " 'ws': 1})})" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rfc_instantiation" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 2 }