Odyssey Tutorial

Welcome to Odyssey! Odyssey is a Python package that can analyze python library usage on GitHub through Google BigQuery. The purpose of this tutorial is to provide you a high-level idea of how to use it. Let’s begin!

Part 1: Work with GithubPython object

We start by introducing a central piece of Odyssey – GithubPython object. This is the object that connects to Github data using BigQuery. It takes care of all the BigQuery connection, SQL query building, result polling, etc. for you.

Let’s start by creating a default GithubPython object. Because we didn’t specify any package, the information we will get is about all data in the BigQuery Github database.

In [1]:
from odyssey.core.bigquery.GithubPython import GithubPython
gp = GithubPython()

Let’s try to see how many Python files in our BigQuery Github database.

You may think: “Wow that’s way less than I expect. Does it mean that we only have ~5.9 million Python files on Github? The answer is no. The main reason is that Google BigQuery only has access to open-sourced repos on Github (those who has certain licences). Therefore, it is just a small subset of the whole Github.

That’s why, if you search for *.py file using Github web GUI, the number you will get won’t be comparable to the number you get here.

In [2]:
print(gp.get_count())
5995653

Now let’s create another GithubPython object, but this time, specify that the package we are interested in is sklearn.

Also, Odyssey allows you to exclude forks of the package, by explicitly providing a list of keywords that shouldn’t appear in the repo name or file path. In this case, scikit-learn is the one we should avoid counting.

In [3]:
gp_sklearn = GithubPython(package="sklearn", exclude_forks=["scikit-learn"])

Let’s count then how many files that count “sklearn”. Caveat: Note that this is a simple string matching. So even if sklearn appears in comment or as a variable name, it will still count!

In [4]:
print(gp_sklearn.get_count())
37262

If you want to see exactly what are those 37262 files that contain the word “sklearn”, you can use get_all() to see all the entries. The return result is a list of BigQueryGithubEntry, a wrapper that provides nice utility function, such as get_url().

In [5]:
data = gp_sklearn.get_all()
In [6]:
print(type(data[0]))
<class 'odyssey.core.bigquery.BigQueryGithubEntry.BigQueryGithubEntry'>
In [7]:
from odyssey.utils.output import pprint_ipynb
In [8]:
pprint_ipynb(data[0])
#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Chainer example: Convolutional Neural Networks with SPP for Sentence Classification

http://emnlp2014.org/papers/pdf/EMNLP2014181.pdf
https://arxiv.org/pdf/1406.4729v4.pdf

"""

__version__ = '0.0.1'

import sys

reload(sys)
sys.setdefaultencoding('utf-8')
#print sys.getdefaultencoding()

import re
import logging
logger = logging.getLogger(__name__)
handler = logging.StreamHandler()
logger.setLevel(logging.DEBUG)
logger.addHandler(handler)

import pprint
def pp(obj):
    pp = pprint.PrettyPrinter(indent=1, width=160)
    str = pp.pformat(obj)
    print re.sub(r"\\u([0-9a-f]{4})", lambda x: unichr(int("0x"+x.group(1),16)), str)

import os, time, six
start_time = time.time()

import struct
import numpy as np
import cPickle as pickle
import matplotlib.pyplot as plt
import copy

from chainer import cuda, Chain, Variable, optimizers, serializers, computational_graph
import chainer.functions as F
import chainer.links as L
import chainer.optimizer

xp = np
BOS_TOKEN = '<s>'
EOS_TOKEN = '</s>'
UNK_TOKEN = '<unk>'
PAD_TOKEN = '<pad>'


def load_w2v_model(path):

    # with open(path, 'rb') as f:
    #     w2i = {}
    #     i2w = {}
    #
    #     n_vocab, n_units = map(int, f.readline().split())
    #     w = np.empty((n_vocab, n_units), dtype=np.float32)
    #
    #     for i in xrange(n_vocab):
    #         word = ''
    #         while True:
    #             ch = f.read(1)
    #             if ch == ' ': break
    #             word += ch
    #
    #         try:
    #             w2i[unicode(word)] = i
    #             i2w[i] = unicode(word)
    #
    #         except RuntimeError:
    #             logging.error('Error unicode(): %s', word)
    #             w2i[word] = i
    #             i2w[i] = word
    #
    #         w[i] = np.zeros(n_units)
    #         for j in xrange(n_units):
    #             w[i][j] = struct.unpack('f', f.read(struct.calcsize('f')))[0]
    #
    #         # ベクトルを正規化する
    #         vlen = np.linalg.norm(w[i], 2)
    #         w[i] /= vlen
    #
    #         # 改行を strip する
    #         assert f.read(1) == '\n'
    # return w, w2i, i2w

    from gensim.models import word2vec
    return word2vec.Word2Vec.load_word2vec_format(path, binary=True)


def load_data(path, w2v):
    X_data, Y = [], []
    labels = {}

    X = []
    max_len = 0

    f = open(path, 'rU')
    for i, line in enumerate(f):
        # if i >= 10:
        #     break

        line = unicode(line).strip()
        if line == u'':
            continue

        cols = line.split(u'\t')
        if len(cols) < 2:
            sys.stderr.write('invalid record: {}\n'.format(line))
            continue

        label = cols[0]
        text  = cols[1]

        tokens = text.split(' ')

        vec = []
        for token in tokens:
            try:
                vec.append(w2v[token])
            except KeyError:
                sys.stderr.write('unk: {}\n'.format(token))
                vec.append(w2v.seeded_vector(UNK_TOKEN))

        if len(vec) > max_len:
            max_len = len(vec)
        X.append(vec)

        if label not in labels:
            labels[label] = len(labels)
        Y.append(labels[label])

    f.close()

    for vec in X:
        pad = [w2v.seeded_vector(PAD_TOKEN) for _ in range(max_len - len(vec))]
        vec.extend(pad)

    return X, Y, labels


class MySPP(Chain):
    def __init__(self, input_channel, output_channel, width, n_units, n_label):
        super(MySPP, self).__init__(
            conv1=L.Convolution2D(input_channel, output_channel, (3, width), pad=0),
            conv2=L.Convolution2D(input_channel, output_channel, (4, width), pad=0),
            conv3=L.Convolution2D(input_channel, output_channel, (5, width), pad=0),
            fc4=L.Linear(output_channel * 3 * 3, n_units),
            fc5=L.Linear(n_units, n_label)
        )

    def __call__(self, x, t, train=True):
        y = self.forward(x, train=train)
        return F.softmax_cross_entropy(y, t), F.accuracy(y, t)

    def forward(self, x, train=True):
        h1 = F.spatial_pyramid_pooling_2d(F.relu(self.conv1(x)), 2, F.MaxPooling2D)
        h2 = F.spatial_pyramid_pooling_2d(F.relu(self.conv2(x)), 2, F.MaxPooling2D)
        h3 = F.spatial_pyramid_pooling_2d(F.relu(self.conv3(x)), 2, F.MaxPooling2D)

        # Convolution + Pooling を行った結果を結合する
        concat = F.concat((h1, h2, h3), axis=1)

        # 結合した結果に Dropout をかける
        h4 = F.dropout(F.tanh(self.fc4(concat)), ratio=0.5, train=train)

        # Dropout の結果を結合する
        y = self.fc5(h4)

        return y


if __name__ == '__main__':

    from argparse import ArgumentParser
    parser = ArgumentParser(description='Chainer example: MySPP')
    parser.add_argument('--train',           default='',  type=unicode, help='training file (.txt)')
    parser.add_argument('--test',            default='',  type=unicode, help='evaluating file (.txt)')
    parser.add_argument('--w2v',       '-w', default='',  type=unicode, help='word2vec model file (.bin)')
    parser.add_argument('--gpu',       '-g', default=-1,  type=int, help='GPU ID (negative value indicates CPU)')
    parser.add_argument('--epoch',     '-e', default=25,  type=int, help='number of epochs to learn')
    parser.add_argument('--unit',      '-u', default=300, type=int, help='number of output channels')
    parser.add_argument('--batchsize', '-b', default=100, type=int, help='learning batchsize size')
    parser.add_argument('--output',    '-o', default='model-spp3-w2v',  type=str, help='output directory')
    args = parser.parse_args()

    if args.gpu >= 0:
        cuda.check_cuda_available()
        cuda.get_device(args.gpu).use()

    xp = cuda.cupy if args.gpu >= 0 else np
    # xp.random.seed(123)

    # 学習の繰り返し回数
    n_epoch = args.epoch

    # 中間層の数
    n_units = args.unit

    # 確率的勾配降下法で学習させる際の1回分のバッチサイズ
    batchsize = args.batchsize

    model_dir = args.output
    if not os.path.exists(model_dir):
        os.mkdir(model_dir)

    print('# loading word2vec model: {}'.format(args.w2v))
    sys.stdout.flush()
    model = load_w2v_model(args.w2v)
    n_vocab = len(model.vocab)

    # データの読み込み
    X, y, labels = load_data(args.train, w2v=model)
    X = xp.asarray(X, dtype=np.float32)
    y = xp.asarray(y, dtype=np.int32)

    n_sample = X.shape[0]
    height   = X.shape[1]
    width    = X.shape[2]
    n_label = len(labels)

    input_channel = 1
    output_channel = 50

    # (nsample, channel, height, width) の4次元テンソルに変換
    X = X.reshape((n_sample, input_channel, height, width))

    # トレーニングデータとテストデータに分割
    from sklearn.model_selection import train_test_split
    # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=123)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10)
    N = len(X_train)
    N_test = len(X_test)

    print('# gpu: {}'.format(args.gpu))
    print('# embedding dim: {}, vocab {}'.format(width, n_vocab))
    print('# epoch: {}'.format(n_epoch))
    print('# batchsize: {}'.format(batchsize))
    print('# input channel: {}'.format(1))
    print('# output channel: {}'.format(n_units))
    print('# train: {}, test: {}'.format(N, N_test))
    print('# data height: {}, width: {}, labels: {}'.format(height, width, n_label))
    sys.stdout.flush()

    # Prepare CNN
    model = MySPP(input_channel, output_channel, width, n_units, n_label)

    if args.gpu >= 0:
        model.to_gpu()

    # 重み減衰
    decay = 0.0001

    # 勾配上限
    grad_clip = 3

    # Setup optimizer (Optimizer の設定)
    # optimizer = optimizers.Adam()
    optimizer = optimizers.AdaDelta()
    optimizer.setup(model)
    optimizer.add_hook(chainer.optimizer.GradientClipping(grad_clip))
    optimizer.add_hook(chainer.optimizer.WeightDecay(decay))

    # プロット用に実行結果を保存する
    train_loss = []
    train_norm = []
    train_accuracy = []
    test_loss = []
    test_accuracy = []

    start_at = time.time()
    cur_at = start_at

    # Learning loop
    for epoch in six.moves.range(1, n_epoch + 1):

        print('epoch {:} / {:}'.format(epoch, n_epoch))
        sys.stdout.flush()

        # sorted_gen = batch(sorted_parallel(X_train, y_train, N * batchsize), batchsize)
        sum_train_loss = 0.
        sum_train_accuracy = 0.
        K = 0

        # training
        # N 個の順番をランダムに並び替える
        perm = np.random.permutation(N)
        for i in six.moves.range(0, N, batchsize):

            x = Variable(X_train[perm[i:i + batchsize]], volatile='off')
            t = Variable(y_train[perm[i:i + batchsize]], volatile='off')

            # 勾配を初期化
            model.cleargrads()

            # 順伝播させて誤差と精度を算出
            loss, accuracy = model(x, t, train=True)

            sum_train_loss += float(loss.data) * len(t)
            sum_train_accuracy += float(accuracy.data) * len(t)
            K += len(t)

            # 誤差逆伝播で勾配を計算
            loss.backward()
            optimizer.update()

        train_loss.append(sum_train_loss / K)
        train_accuracy.append(sum_train_accuracy / K)

        # 訓練データの誤差と,正解精度を表示
        now = time.time()
        throuput = now - cur_at
        norm = optimizer.compute_grads_norm()
        print('train mean loss={:.6f}, accuracy={:.6f} ({:.6f} sec)'.format(sum_train_loss / K, sum_train_accuracy / K, throuput))
        sys.stdout.flush()
        cur_at = now

        # evaluation
        sum_test_loss = 0.
        sum_test_accuracy = 0.
        K = 0
        for i in six.moves.range(0, N_test, batchsize):

            x = Variable(X_test[i:i + batchsize], volatile='on')
            t = Variable(y_test[i:i + batchsize], volatile='on')

            # 順伝播させて誤差と精度を算出
            loss, accuracy = model(x, t, train=False)

            sum_test_loss += float(loss.data) * len(t)
            sum_test_accuracy += float(accuracy.data) * len(t)
            K += len(t)

        test_loss.append(sum_test_loss / K)
        test_accuracy.append(sum_test_accuracy / K)

        # テストデータでの誤差と正解精度を表示
        now = time.time()
        throuput = now - cur_at
        print(' test mean loss={:.6f}, accuracy={:.6f} ({:.6f} sec)'.format(sum_test_loss / K, sum_test_accuracy / K, throuput))
        sys.stdout.flush()
        cur_at = now

        # model と optimizer を保存する
        if args.gpu >= 0: model.to_cpu()
        with open(os.path.join(model_dir, 'epoch_{:03d}.model'.format(epoch)), 'wb') as f:
            pickle.dump(model, f)
        if args.gpu >= 0: model.to_gpu()
        with open(os.path.join(model_dir, 'epoch_{:03d}.state'.format(epoch)), 'wb') as f:
            pickle.dump(optimizer, f)

        # 精度と誤差をグラフ描画
        if True:
            ylim1 = [min(train_loss + test_loss), max(train_loss + test_loss)]
            ylim2 = [min(train_accuracy + test_accuracy), max(train_accuracy + test_accuracy)]

            # グラフ左
            plt.figure(figsize=(10, 10))
            plt.subplot(1, 2, 1)
            plt.ylim(ylim1)
            plt.plot(range(1, len(train_loss) + 1), train_loss, 'b')
            plt.grid()
            plt.ylabel('loss')
            plt.legend(['train loss', 'train l2-norm'], loc="lower left")
            plt.twinx()
            plt.ylim(ylim2)
            plt.plot(range(1, len(train_accuracy) + 1), train_accuracy, 'm')
            plt.grid()
            # plt.ylabel('accuracy')
            plt.legend(['train accuracy'], loc="upper left")
            plt.title('Loss and accuracy of training.')

            # グラフ右
            plt.subplot(1, 2, 2)
            plt.ylim(ylim1)
            plt.plot(range(1, len(test_loss) + 1), test_loss, 'b')
            plt.grid()
            # plt.ylabel('loss')
            plt.legend(['test loss'], loc="lower left")
            plt.twinx()
            plt.ylim(ylim2)
            plt.plot(range(1, len(test_accuracy) + 1), test_accuracy, 'm')
            plt.grid()
            plt.ylabel('accuracy')
            plt.legend(['test accuracy'], loc="upper left")
            plt.title('Loss and accuracy of test.')

            plt.savefig('{}.png'.format(model_dir))
            # plt.show()

        cur_at = now

    # model と optimizer を保存する
    if args.gpu >= 0: model.to_cpu()
    with open(os.path.join(model_dir, 'final.model'), 'wb') as f:
        pickle.dump(model, f)
    if args.gpu >= 0: model.to_gpu()
    with open(os.path.join(model_dir, 'final.state'), 'wb') as f:
        pickle.dump(optimizer, f)

print('time spent:', time.time() - start_time)
In [9]:
data[0].get_url() # a link to Github file
Out[9]:
'https://github.com/haradatm/nlp/tree/master/classify/train_spp3-w2v.py'

Part 3: Repos with top imports

One common question Python library writers (or even users) are interested in is: who is using this library? Odyssey supports querying repos with top imports of your package-in-interest. In one line, you can get the answer!

Note: The first time running this will be very slow!

In [17]:
top20_imports = gp_sklearn.get_top_import_repo(n=20) # top imports by file count
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
In [18]:
print(top20_imports)
[('ngoix/OCRF', 291), ('automl/auto-sklearn', 195), ('hmendozap/auto-sklearn', 186), ('florian-f/sklearn', 146), ('seckcoder/lang-learn', 141), ('GbalsaC/bitnamiP', 119), ('automl/paramsklearn', 100), ('chaluemwut/fbserver', 99), ('magic2du/contact_matrix', 96), ('nok/sklearn-porter', 95), ('jpzk/evopy', 87), ('B3AU/waveTree', 77), ('sinhrks/expandas', 64), ('chkoar/imbalanced-learn', 61), ('liyu1990/sklearn', 61), ('KennyCandy/HAR', 54), ('sinhrks/pandas-ml', 54), ('RecipeML/Recipe', 52), ('dvro/imbalanced-learn', 51), ('Tjorriemorrie/trading', 51)]
In [22]:
# Verify that the the count matches
print(len(top20_imports)) # 20
20

Part 4: Most imported class/submodule/funcion

Another common question is how often a certain class/submodule/function is imported. Odyssey can answer that too.

In [23]:
top20_models = gp_sklearn.get_most_imported_class(n=20)
In [24]:
print(top20_models)
[('RandomForestClassifier', 2534), ('LogisticRegression', 2152), ('SVC', 1998), ('StandardScaler', 1783), ('PCA', 1732), ('Pipeline', 1519), ('GridSearchCV', 1511), ('KMeans', 1451), ('TfidfVectorizer', 1314), ('CountVectorizer', 1294), ('KNeighborsClassifier', 1188), ('LinearSVC', 1116), ('DecisionTreeClassifier', 1047), ('LinearRegression', 861), ('GaussianNB', 817), ('LabelEncoder', 728), ('MultinomialNB', 723), ('RandomForestRegressor', 681), ('AdaBoostClassifier', 673), ('SGDClassifier', 642)]

See what are the entries by calling get_import_source()

In [25]:
sources = gp_sklearn.get_import_source("RandomForestClassifier")
In [26]:
pprint_ipynb(sources[0])
# coding: utf-8

# ### Open using Jupyter Notebook. It holds the code and visualizations for developing the different classification algorithms (LibSVM, RBF SVM, Naive Bayes, Random Forest, Gradient Boosting) on the chosen subset of important features. 

# In[27]:

import pandas as pd
import numpy as np
from numpy import sort
from sklearn.metrics import matthews_corrcoef, accuracy_score,confusion_matrix
from sklearn.feature_selection import SelectFromModel
from matplotlib import pyplot
import pylab as pl
from sklearn import svm

get_ipython().magic(u'matplotlib inline')


# In[4]:

SEED = 1234
## Selected set of most important features

featureSet=['L3_S31_F3846','L1_S24_F1578','L3_S33_F3857','L1_S24_F1406','L3_S29_F3348','L3_S33_F3863',
            'L3_S29_F3427','L3_S37_F3950','L0_S9_F170', 'L3_S29_F3321','L1_S24_F1346','L3_S32_F3850',
            'L3_S30_F3514','L1_S24_F1366','L2_S26_F3036']

train_x = pd.read_csv("../data/train_numeric.csv", usecols=featureSet)
train_y = pd.read_csv("../data/train_numeric.csv", usecols=['Response'])


# In[5]:

test_x = pd.read_csv("../data/test_numeric.csv", usecols=featureSet)


# In[6]:

train_x = train_x.fillna(9999999)
msk = np.random.rand(len(train_x)) < 0.7  # creating Training and validation set 


X_train = train_x[msk]

Y_train = train_y.Response.ravel()[msk]

X_valid = train_x[~msk]
Y_valid = train_y.Response.ravel()[~msk]


# In[7]:

def showconfusionmatrix(cm, typeModel):
    pl.matshow(cm)
    pl.title('Confusion matrix for '+typeModel)
    pl.colorbar()
    pl.show()


# In[24]:

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

C=4
lin_svc = svm.LinearSVC(C=C).fit(X_train, Y_train)
print "LibSVM fitted"

title = 'LinearSVC (linear kernel)'

predicted = lin_svc.predict(X_valid)
mcc= matthews_corrcoef(Y_valid, predicted)
print "MCC Score \t +"+title+str(mcc)

cm = confusion_matrix(predicted, Y_valid)
showconfusionmatrix(cm, title)
print "Confusion Matrix"
print (cm)


# In[22]:

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

C=4
rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X_train, Y_train)
print "RBF fitted"


title = 'SVC with RBF kernel'

predicted = rbf_svc.predict(X_valid)
mcc= matthews_corrcoef(Y_valid, predicted)
print "MCC Score \t +"+title+str(mcc)

cm = confusion_matrix(predicted, Y_valid)
showconfusionmatrix(cm, title)
print "Confusion Matrix"
print (cm)


# In[10]:


from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

clf = gnb.fit(X_train,Y_train)
print "Naive Bayes Fitted"


title = 'Naive Bayes'

predicted = clf.predict(X_valid)


mcc= matthews_corrcoef(Y_valid, predicted)
print "MCC Score \t +"+title+str(mcc)

cm = confusion_matrix(predicted, Y_valid)
showconfusionmatrix(cm, title)
print "Confusion Matrix"
print (cm)


# In[21]:

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import GridSearchCV


# In[23]:

rf = RandomForestClassifier(n_estimators=20, n_jobs=2)
param_grid = {
                 'n_estimators': [5, 10, 15, 20],
                 'max_depth': [2, 5, 7, 9]
             }


# In[24]:

grid_rf = GridSearchCV(rf, param_grid, cv=10)
rf_model=grid_rf.fit(X_train, Y_train)


# In[30]:

print "RF fitted"

titles = 'Random Forest'

predicted = rf_model.predict(X_valid)
mcc= matthews_corrcoef(Y_valid, predicted)
print "MCC Score \t +"+titles[0]+str(mcc)

cm = confusion_matrix(predicted, Y_valid)
showconfusionmatrix(cm, titles[0])


# In[31]:

gb = GradientBoostingClassifier(learning_rate=0.5)
param_grid = {
                 'n_estimators': [5, 10, 15, 20],
                 'max_depth': [2, 5, 7, 9]
             }


# In[32]:

grid_gb = GridSearchCV(gb, param_grid, cv=10)
gb_model=grid_gb.fit(X_train, Y_train)


# In[36]:

print "GB fitted"

title = 'Gradient Boosting'

predicted = gb_model.predict(X_valid)
mcc= matthews_corrcoef(Y_valid, predicted)
print "MCC Score \t +"+title+str(mcc)

cm = confusion_matrix(predicted, Y_valid)
showconfusionmatrix(cm, title)

Part 5: Instantiation

For classes, Odyssey can provide you with insights about how they are instantiated, default argument value people use, etc.

Note: All the arguments in the returned dictionary are in string format (even for integer values). This may be changed later.

In [27]:
rfc_instantiation = gp_sklearn.get_instantiation("RandomForestClassifier")
In [28]:
rfc_instantiation
Out[28]:
defaultdict(<function odyssey.core.analyzer.InstantiationAnalyzer.InstantiationAnalyzer.__init__.<locals>.<lambda>>,
            {'*': defaultdict(int, {None: 4}),
             '**': defaultdict(int, {None: 24}),
             '**args': defaultdict(int, {None: 2}),
             '**classi_params': defaultdict(int, {None: 1}),
             '**classif_base.get_params()': defaultdict(int, {None: 1}),
             '**classifier_pram_dic[rf_name]': defaultdict(int, {None: 1}),
             "**clf.get('config')": defaultdict(int, {None: 2}),
             '**clf_args_': defaultdict(int, {None: 2}),
             '**clf_params': defaultdict(int, {None: 1}),
             '**cls_kwargs': defaultdict(int, {None: 1}),
             '**config_clf': defaultdict(int, {None: 1}),
             '**estimator_params': defaultdict(int, {None: 4}),
             '**forest_parms': defaultdict(int, {None: 2}),
             '**gs.best_params_': defaultdict(int, {None: 1}),
             '**job': defaultdict(int, {None: 1}),
             '**kwargs': defaultdict(int, {None: 10}),
             '**model_params': defaultdict(int, {None: 3}),
             '**params': defaultdict(int, {None: 13}),
             '**params_used': defaultdict(int, {None: 1}),
             '**parse.config_for_function(RandomForestClassifier.__init__, config)': defaultdict(int,
                         {None: 2}),
             '**rf_config': defaultdict(int, {None: 1}),
             '**rf_parameters': defaultdict(int, {None: 3}),
             '**rf_params': defaultdict(int, {None: 1}),
             '**self.param_dict': defaultdict(int, {None: 1}),
             '**self.params': defaultdict(int, {None: 4}),
             "**{'n_estimators' : 7500, 'max_depth' : 200}": defaultdict(int,
                         {None: 2}),
             'bootstrap': defaultdict(int,
                         {'False': 26,
                          'True': 62,
                          'bootstrap': 2,
                          'bs': 1,
                          'bstp': 1,
                          'config["rf:bootstrap"]': 1,
                          'context["classifiers"][classifier_name]["learning_algorithm"]["parameters"]["bootstrap"]': 1,
                          'estimator.best_estimator_.bootstrap': 2,
                          'p2': 1,
                          'param_bootstrap': 2,
                          'self.bootstrap': 4,
                          'self.bootstrap_forest': 1,
                          'settings["bootstrap"]': 1}),
             'class_weight': defaultdict(int,
                         {' {0: 1, 1:10}': 3,
                          ' {0:0.098, \n                                                                                    1:0.111, \n                                                                                    2:0.104, \n                                                                                    3:0.102, \n                                                                                    4:0.098, \n                                                                                    5:0.088, \n                                                                                    6:0.095, \n                                                                                    7:0.103, \n                                                                                    8:0.098, \n                                                                                    9:0.102}': 2,
                          '"auto"': 3,
                          '"balanced"': 35,
                          '"balanced_subsample"': 1,
                          "'auto'": 16,
                          "'balanced'": 26,
                          "'balanced_subsample'": 5,
                          "'subsample'": 1,
                          'None': 21,
                          'class_weight': 5,
                          'class_wt': 1,
                          'cw': 1,
                          'param_class_weight': 2,
                          "params['class_weight']": 1,
                          'rf_weights': 1,
                          'self.class_weight': 5,
                          'self.class_weight_forest': 1,
                          'weight': 2,
                          '{0: 1, 1: 28}': 2,
                          "{0: 1, 1: space['cw']}": 1,
                          "{0:1, 1: space['cw']}": 1,
                          '{0:100, 1:1}': 1,
                          '{1:weights*np.count_nonzero(Y)/len(Y),0:1-(np.count_nonzero(Y)/len(Y))}': 1,
                          '{False:1, True:1}': 1}),
             'compute_importances': defaultdict(int,
                         {'False': 2, 'None': 1, 'True': 25}),
             'criterion': defaultdict(int,
                         {'"entropy"': 37,
                          '"gini"': 9,
                          "'entropy'": 66,
                          "'gini'": 81,
                          'CRIT': 1,
                          'args[1]': 1,
                          'c': 5,
                          'config["rf:criterion"]': 1,
                          'context["classifiers"][classifier_name]["learning_algorithm"]["parameters"]["criterion"]': 1,
                          'crit': 1,
                          'crit_out': 6,
                          'criterion': 13,
                          "criterion[best['criterion']]": 2,
                          'criterion_t': 1,
                          'estimator.best_estimator_.criterion': 2,
                          'feature': 1,
                          'p1': 1,
                          'param_criterion': 2,
                          "params['criterion']": 1,
                          'rf_criterion': 3,
                          'self.criterion': 6,
                          'self.criterion_forest': 1,
                          'settings["criterion"]': 1,
                          'splitcriteria_param': 1}),
             'featuresCol': defaultdict(int, {'"features"': 1}),
             'labelCol': defaultdict(int, {'"Response"': 1}),
             'maxDepth': defaultdict(int, {'15': 1}),
             'max_depth': defaultdict(int,
                         {'1': 6,
                          '10': 36,
                          '100': 10,
                          '13': 5,
                          '14': 1,
                          '15': 8,
                          '16': 10,
                          '17': 3,
                          '2': 11,
                          '20': 4,
                          '2000': 1,
                          '22': 1,
                          '25': 5,
                          '3': 11,
                          '30': 2,
                          '4': 11,
                          '40': 3,
                          '5': 57,
                          '50': 9,
                          '52': 14,
                          '6': 5,
                          '60': 2,
                          '600': 1,
                          '7': 6,
                          '700': 1,
                          '8': 7,
                          '80': 2,
                          '9': 2,
                          'C': 2,
                          'None': 126,
                          'RFC_depth': 6,
                          'RF_depth': 1,
                          'TREE_DEPTH': 2,
                          '_max_depth': 1,
                          'args["max_tree_nodes"]': 1,
                          'args[2]': 1,
                          'best_m': 1,
                          "best_pars['max_depth']": 1,
                          'config["rf:max_depth"]': 1,
                          'context["classifiers"][classifier_name]["learning_algorithm"]["parameters"]["max_depth"]': 1,
                          'depth': 7,
                          'depth_out': 4,
                          'estimator.best_estimator_.max_depth': 2,
                          'feature': 1,
                          "grid_search.best_params_['max_depth']": 1,
                          'hyper_parameter': 2,
                          'length': 1,
                          'm': 1,
                          'm_d': 2,
                          'm_dep': 3,
                          'maxDepth[0]': 1,
                          'max_D': 1,
                          'max_d': 2,
                          'max_dep': 9,
                          'max_depth': 24,
                          'max_depth_option': 1,
                          'max_tree_depth': 1,
                          'md': 2,
                          'p4': 1,
                          'param_max_depth': 2,
                          "params['max_depth']": 1,
                          "paras['rf'][0]": 3,
                          'rf_max_depth': 3,
                          "self._settings.get('max_depth', 10)": 1,
                          'self.k': 1,
                          'self.max_depth': 6,
                          'self.max_depth_forest': 1,
                          "space['max_depth']": 3}),
             'max_features': defaultdict(int,
                         {' int(math.sqrt(features))': 1,
                          " params['max_features']": 1,
                          '"auto"': 17,
                          '"log2"': 7,
                          '"sqrt"': 25,
                          "'auto'": 64,
                          "'log2'": 7,
                          "'sqrt'": 17,
                          '.33': 1,
                          '0.1': 2,
                          '0.2': 1,
                          '0.4': 3,
                          '0.497907908371': 1,
                          '0.5': 2,
                          '0.59': 1,
                          '0.6': 1,
                          '0.7': 1,
                          '0.8': 1,
                          '1': 59,
                          '1.': 1,
                          '1.0/3': 1,
                          '10': 6,
                          '100': 2,
                          '128': 1,
                          '15': 1,
                          '16': 1,
                          '2': 3,
                          '20': 2,
                          '200': 1,
                          '3': 5,
                          '30': 1,
                          '375': 1,
                          '38': 1,
                          '4': 4,
                          '5': 9,
                          '50': 1,
                          '500': 2,
                          '7': 3,
                          '8': 1,
                          '80': 1,
                          'None': 43,
                          'R': 1,
                          'SILLY_NUMBER': 1,
                          "best_params[dataset_name][method_name]['rf_max_features']": 1,
                          "best_params[dataset_name][method_name][nr_events]['rf_max_features']": 2,
                          "best_pars['max_features']": 1,
                          'c_max_features': 1,
                          'config[\n                                                "rf:max_features"]': 1,
                          'context["classifiers"][classifier_name]["learning_algorithm"]["parameters"]["max_features"]': 1,
                          'feature': 4,
                          'features': 3,
                          "grid_search.best_params_['max_features']": 1,
                          'individual[2]': 1,
                          'int(math.sqrt(n_features))': 2,
                          'int(mtry)': 1,
                          'int(np.sqrt(len(self.dataframe.columns)))': 1,
                          'k': 2,
                          'm_f': 2,
                          'm_feat': 3,
                          'max_f': 3,
                          'max_feat_out': 5,
                          'max_feature': 1,
                          'max_features': 28,
                          'max_features_options': 1,
                          'mf': 4,
                          'min(49, len(result1.columns) - 1)': 1,
                          'min(52, len(result1.columns) - 1)': 1,
                          'min(64, len(result2.columns) - 1)': 1,
                          'mtry': 2,
                          'n_feat': 2,
                          'n_features': 1,
                          'p5': 1,
                          'param_max_features': 2,
                          "params['max_features']": 1,
                          "paras['rf'][2]": 3,
                          'rf_max_features': 7,
                          'rf_no_active_vars': 3,
                          'self.__max_features': 2,
                          'self.max_features': 3,
                          'self.max_features_forest': 1,
                          'settings["max_features"]': 1,
                          "space['max_features']": 3,
                          'total_features': 2,
                          'tree_features': 2,
                          'tunings[1] / 100': 8}),
             'max_leaf_nodes': defaultdict(int,
                         {'1000': 3,
                          '365': 14,
                          '50': 2,
                          'None': 26,
                          'feature': 1,
                          'int(tunings[4])': 1,
                          'max_leaf_nodes_options': 1,
                          'mln': 1,
                          'node_out': 1,
                          'param_max_leaf_nodes': 1,
                          "params['max_leaf_nodes']": 1,
                          'self.max_leaf_nodes': 3,
                          'self.max_leaf_nodes_forest': 1}),
             'minInstances': defaultdict(int, {'10': 1}),
             'min_impurity_split': defaultdict(int,
                         {'0.1': 1, '1e-07': 7, '1e-7': 1}),
             'min_samples_leaf': defaultdict(int,
                         {'1': 31,
                          '1.0': 1,
                          '10': 6,
                          '100': 1,
                          '1000': 1,
                          '15': 1,
                          '150': 1,
                          '2': 23,
                          '20': 8,
                          '200': 2,
                          '3': 6,
                          '365': 14,
                          '4': 3,
                          '5': 13,
                          '6': 1,
                          '8': 10,
                          '9': 1,
                          'args["min_samples_leaf"]': 1,
                          'best_param': 1,
                          "best_pars['msl']": 1,
                          'config[\n                                                "rf:min_samples_leaf"]': 1,
                          'context["classifiers"][classifier_name]["learning_algorithm"]["parameters"][\n                "min_samples_leaf"]': 1,
                          'individual[1]': 1,
                          'int(np.round(x[i]))': 2,
                          'int(settings["min_sample_leaf"])': 1,
                          'int(tunings[2])': 8,
                          'leaf_size': 2,
                          'm_s_l': 2,
                          'm_sam_leaf': 3,
                          'min_samples_at_leaf': 1,
                          'min_samples_leaf': 10,
                          'min_samples_leaf_options': 1,
                          'msl': 4,
                          'n': 8,
                          'nodes': 6,
                          'p7': 1,
                          'param': 1,
                          'param_min_samples_leaf': 2,
                          "params['min_samples_leaf']": 1,
                          'self.min_samples_leaf': 6,
                          'self.min_samples_leaf_forest': 1,
                          "space['msl']": 3,
                          'val': 1}),
             'min_samples_split': defaultdict(int,
                         {'0.02': 1,
                          '1': 80,
                          '10': 14,
                          '100': 10,
                          '1000': 1,
                          '12': 4,
                          '13': 2,
                          '15': 2,
                          '16': 3,
                          '163': 1,
                          '17': 5,
                          '2': 67,
                          '2*min_samples_at_leaf': 1,
                          '20': 1,
                          '25': 1,
                          '256': 1,
                          '3': 2,
                          '30': 1,
                          '32': 1,
                          '4': 15,
                          '5': 7,
                          '50': 5,
                          '7': 1,
                          '70': 2,
                          '76': 14,
                          '8': 4,
                          '9': 4,
                          'args["min_samples_split"]': 1,
                          "best_pars['mss']": 1,
                          'config[\n                                                "rf:min_samples_split"]': 1,
                          'context["classifiers"][classifier_name]["learning_algorithm"]["parameters"][\n                "min_samples_split"]': 1,
                          'feature': 1,
                          'individual[0]': 1,
                          'int(settings["min_sample_split"])': 1,
                          'int(tunings[3])': 8,
                          'len(self.x) / 8': 1,
                          'm_s_s': 2,
                          'min_sample': 1,
                          'min_samples': 4,
                          'min_samples_spl': 1,
                          'min_samples_split': 6,
                          'nodes*2': 6,
                          'p6': 1,
                          'param_min_samples_split': 2,
                          "params['min_samples_split']": 1,
                          'rf_min_sample_count': 3,
                          'sample_out': 3,
                          'self.min_samples_split': 4,
                          'self.min_samples_split_forest': 1,
                          "space['mss']": 3}),
             'min_weight_fraction_leaf': defaultdict(int,
                         {'0': 6,
                          '0.0': 15,
                          '0.1': 1,
                          '0.5': 1,
                          'feature': 1,
                          'frac_out': 2,
                          'int(settings["min_weight_faction_leaf"])': 1,
                          'min_weight_fraction_leaf': 2,
                          'mwfl': 1,
                          'param_min_weight_fraction_leaf': 1,
                          'self.min_weight_fraction_leaf': 3,
                          'self.min_weight_fraction_leaf_forest': 1}),
             'n_estimators': defaultdict(int,
                         {' n_estimators/2': 1,
                          " params['n_estimators']": 1,
                          ' pm.num_trees': 2,
                          ' self.RF_size': 1,
                          ' self.n_estimators': 1,
                          ' self.n_trees': 2,
                          ' self.ntrees': 1,
                          '0': 3,
                          '1': 20,
                          '10': 206,
                          '100': 435,
                          '1000': 69,
                          '10000': 9,
                          '101': 2,
                          '1024': 5,
                          '11': 3,
                          '12': 2,
                          '120': 4,
                          '1200': 4,
                          '12000': 1,
                          '128': 2,
                          '13': 2,
                          '1400': 2,
                          '15': 11,
                          '150': 27,
                          '1500': 3,
                          '15000': 1,
                          '17': 1,
                          '18': 2,
                          '180': 1,
                          '196': 1,
                          '198': 14,
                          '1999': 1,
                          '2': 3,
                          '20': 44,
                          '20*8': 1,
                          '200': 48,
                          '2000': 17,
                          '22': 1,
                          '23': 1,
                          '240': 2,
                          '25': 32,
                          '250': 10,
                          '2500': 3,
                          '256': 6,
                          '3': 3,
                          '30': 32,
                          '300': 38,
                          '3000': 6,
                          '30000': 1,
                          '32': 3,
                          '34': 1,
                          '35': 3,
                          '350': 1,
                          '4': 2,
                          '40': 20,
                          '400': 7,
                          '48': 1,
                          '5': 10,
                          '50': 97,
                          '500': 122,
                          '5000': 7,
                          '51': 3,
                          '512': 4,
                          '52': 2,
                          '55': 1,
                          '550': 1,
                          '6': 1,
                          '60': 4,
                          '600': 1,
                          '625': 1,
                          '64': 3,
                          '65': 1,
                          '7': 1,
                          '700': 2,
                          '75': 2,
                          '750': 3,
                          '8': 1,
                          '80': 6,
                          '800': 4,
                          '8000': 1,
                          '84': 1,
                          '850': 2,
                          '9': 1,
                          '90': 2,
                          '900': 1,
                          '91': 1,
                          '94': 3,
                          '95': 1,
                          '99': 3,
                          'C': 1,
                          'NEST': 1,
                          'R': 1,
                          'RFC_estimators': 6,
                          'RF_estimators': 1,
                          'RF_size': 3,
                          'args.ntrees': 1,
                          'args["num_trees"]': 1,
                          'args[0]': 1,
                          "best['n_estimators']": 2,
                          'best_n': 1,
                          'best_param_rf.get("n_estimators")': 2,
                          "best_pars['n_estimators']": 1,
                          'config["rf:n_estimators"]': 1,
                          'context["classifiers"][classifier_name]["learning_algorithm"]["parameters"]["n_estimators"]': 1,
                          'e': 1,
                          'est': 1,
                          'estimator': 4,
                          'estimator_param': 1,
                          'estimators': 5,
                          'feature': 1,
                          'i': 3,
                          'idx + 1': 2,
                          'individual[3]': 1,
                          'inner_estimators': 1,
                          'int(SILLY_NUMBER*1.5)': 1,
                          'int(len(MetricEntry.metrics)/3)': 1,
                          'int(numbtrees_param)': 1,
                          'int(settings["n_estimators"])': 1,
                          'int(tunings[0])': 8,
                          'lNbEstimatorsInEnsembles': 2,
                          'max_random_trees': 2,
                          'min_log_loss_iter': 1,
                          "model_param['n_estimators']": 1,
                          'mp.random_forest_estimators': 1,
                          'n': 15,
                          'n_cpu*trees_per_compute': 2,
                          'n_est': 11,
                          'n_estim': 5,
                          'n_estimator': 1,
                          'n_estimators': 73,
                          'n_estimators[0]': 1,
                          'n_estimators_options': 1,
                          'n_estimators_size': 2,
                          'n_out': 7,
                          'n_tree': 5,
                          'n_trees': 15,
                          'ne': 1,
                          'nest': 2,
                          'nr_of_trees': 1,
                          'nr_trees': 1,
                          'ntrees': 11,
                          'num': 1,
                          'numE': 1,
                          'numTrees': 2,
                          'num_estimators': 1,
                          'num_trees': 7,
                          'opts.estimators': 3,
                          'opts.numtrees': 4,
                          'p3': 1,
                          'param_n_estimators': 1,
                          "params['n_estimators']": 1,
                          "paras['rf'][1]": 3,
                          'rf_max_num_trees': 3,
                          'rf_n_estimators': 10,
                          'self.Nestimators': 1,
                          'self.__n_TreesInForest': 1,
                          'self.__n_estimators': 2,
                          "self._settings.get('trees', 10)": 1,
                          'self.config.hid_layer_units': 1,
                          'self.config.hid_layer_units_baseline': 1,
                          'self.n_estimators': 16,
                          'self.n_estimators_forest': 1,
                          'self.n_trees': 1,
                          'self.numTrees': 5,
                          "self.params['num_estimators']": 1,
                          'self.randomForestEstimators': 1,
                          "space['n']": 1,
                          "space['n_estimators']": 2,
                          'sqrt_feat_num': 1,
                          'trees': 9,
                          'val': 1}),
             'n_jobs': defaultdict(int,
                         {' -1': 31,
                          ' pm.n_jobs': 1,
                          ' self.n_jobs': 1,
                          '-1': 391,
                          '-2': 1,
                          '1': 55,
                          '10': 11,
                          '12': 8,
                          '15': 1,
                          '16': 4,
                          '2': 66,
                          '3': 8,
                          '4': 45,
                          '40': 1,
                          '5': 20,
                          '6': 3,
                          '7': 4,
                          '8': 14,
                          'NUM_THREADS': 2,
                          'PROCESSORS': 1,
                          'args.cpu': 1,
                          'args.njobs': 1,
                          'cores': 1,
                          'cpu_counts': 1,
                          'cpus': 2,
                          'int(settings["n_jobs"])': 1,
                          'jobs': 5,
                          'n_cores': 1,
                          'n_cpu': 2,
                          'n_estimators': 1,
                          'n_jobs': 29,
                          'njobs': 3,
                          'numJobs': 1,
                          'num_jobs': 2,
                          'number_of_threads': 1,
                          'options.n_jobs': 1,
                          'options.pyxit_n_jobs': 1,
                          'opts.nprocessors': 1,
                          'opts.numproc': 1,
                          'param_n_jobs': 2,
                          'self.n_jobs': 10,
                          'self.n_jobs_forest': 1,
                          'self.nthreads': 1,
                          'self.parallel_jobs': 1,
                          "self.params['num_jobs']": 1,
                          'self.threadCount': 1,
                          'workers': 3}),
             'numTrees': defaultdict(int, {'60': 1}),
             'oob_score': defaultdict(int,
                         {'1': 4,
                          'False': 34,
                          'True': 106,
                          'oob_score': 1,
                          'os': 1,
                          'param_oob_score': 2,
                          'self.oob_score_forest': 1}),
             'random_state': defaultdict(int,
                         {' self.ran_stat': 2,
                          '0': 217,
                          '1': 116,
                          '10': 2,
                          '1000 + l': 2,
                          '1000+l': 2,
                          '1104': 1,
                          '123': 13,
                          '1234': 1,
                          '12345': 4,
                          '125': 1,
                          '13': 4,
                          '1301': 1,
                          '131': 4,
                          '1337': 1,
                          '142': 1,
                          '144': 2,
                          '150': 1,
                          '17': 1,
                          '192': 3,
                          '1960': 3,
                          '2': 4,
                          '20': 3,
                          '2016': 1,
                          '21': 1,
                          '234': 1,
                          '241': 1,
                          '2543': 2,
                          '30': 4,
                          '32': 1,
                          '321': 1,
                          '324089': 2,
                          '33': 2,
                          '4': 8,
                          '4141': 1,
                          '42': 61,
                          '451': 1,
                          '5': 2,
                          '50': 8,
                          '571': 1,
                          '600': 1,
                          '7': 4,
                          '7112016': 8,
                          '77': 2,
                          '782629': 1,
                          '84': 3,
                          '87': 1,
                          '88': 1,
                          '93758': 1,
                          'None': 31,
                          'RANDOM_STATE': 7,
                          'RDM': 1,
                          'RND_SEED': 3,
                          'RandomState(__seed__)': 1,
                          'RandomState(seed)': 1,
                          'SEED': 1,
                          'args["seed"]': 1,
                          'choosen_random_state': 2,
                          'generator': 2,
                          'i': 6,
                          'n': 1,
                          'np.random.RandomState(0)': 1,
                          'param_random_state': 2,
                          'prng': 1,
                          'rand': 1,
                          'rand_state': 1,
                          'random': 9,
                          'random_seed': 1,
                          'random_state': 37,
                          'randomseedcounter': 4,
                          'rng': 5,
                          'seed': 7,
                          'self.random_state': 12,
                          'self.random_state_forest': 1,
                          'self.rng': 1,
                          'self.rs': 2,
                          'self.seed': 2,
                          'settings["random_state"]': 1}),
             'seed': defaultdict(int, {'1111': 1}),
             'verbose': defaultdict(int,
                         {'(\n                                                   2 if debug is True else 0)': 1,
                          '(2 if debug is True else 0)': 1,
                          '(args.loglevel == logging.DEBUG)': 1,
                          '0': 37,
                          '1': 29,
                          '10': 3,
                          '2': 30,
                          '20': 15,
                          '3': 6,
                          '42': 1,
                          'False': 3,
                          'True': 8,
                          'VERBOSE': 2,
                          'int(settings["verbose"])': 1,
                          'options.verbose': 1,
                          'param_verbose': 2,
                          'self.verbose_forest': 1,
                          'verbose': 19}),
             'warm_start': defaultdict(int,
                         {'False': 22,
                          'True': 20,
                          'param_warm_start': 2,
                          'self.warm_start_forest': 1,
                          'ws': 1})})