Our use-case is to use C2Numpy inside ROOTand the process the classification problem of particles by using Xgboostand pandas.
Our data set basically has 6 kinematic variables (like jet pt and dilepton mass etc.) Then a label set telling us from monte carlo is the particle is sourced from interesting process or not (also we have a weight array, because events come from Monte Carlo need to be weighted).
Assume we prepared these accordingly in .npy files, we can then load and put data intu Pandas.DataFrame:
xfiles= glob.glob("./xdata_*.npy")
xfiles.sort()
yfiles= glob.glob("./ydata_*.npy")
yfiles.sort()
xarrays = [np.load(f) for f in xfiles]
rawdata= np.concatenate(xarrays)
yarrays = [np.load(f) for f in yfiles]
rawydata= np.concatenate(yarrays)
dfx = pd.DataFrame(rawdata)
dfy = pd.DataFrame(rawydata)
setsize = rawydata.shape[0]
Then, we want to shuffle the data. Here, not to wrecked by np.random.seed, you want to generate a permutation list according to length of data set first.
perm = np.random.permutation(setsize)
dfx = dfx.iloc[perm]
dfy = dfy.iloc[perm]
And then you may need to extract/drop rows (for us, the weight row) for separate use in XGboost:
weight = dfx['weight']
dfx= dfx.drop('weight',axis=1)
# separate into train and test set
weight_train = weight.head(int(setsize*0.7))
weight_test= weight.tail(int(setsize*0.3))
trainx = dfx.head(int(setsize*0.7))
testx = dfx.tail(int(setsize*0.3))
trainy = binaryy[:int(setsize*0.7)]
testy = binaryy[-int(setsize*0.3):]
And then you’re basically free to go!
dtrain = xgb.DMatrix(trainx.values, label=trainy, weight=np.abs(weight_train.values))
dtest = xgb.DMatrix(testx.values, label=testy, weight=np.abs(weight_test.values))
evallist = [(dtest,'eval'), (dtrain,'train')]
num_round = 700
param = {}
param['objective'] = 'binary:logistic'
param['eta'] = 0.05
param['max_depth'] = 4
param['silent'] = 1
param['nthread'] = 12
param['eval_metric'] = "auc"
param['subsample'] = 0.6
param['colsample_bytree'] = 0.5
bst = xgb.train(param.items(), dtrain, num_round, evallist, early_stopping_rounds=200)
bst.save_model('./001.model')