Implemented Solutions to Problems 2 and 3 #15

jeffbulmer · 2019-05-27T07:24:38Z

Hello,

I have implemented solutions to the second and third parts of the clever-challenge. The second part was done in golang using a simple text parser to parse the ast. The particular structure of this ast is exploited, though the solutions should work for many other asts.
The third part of the challenge was solved using R to run some basic classification algorithms on the data, primarily for the purposes of variable selection. This was done in R Markdown, so a summary of the whole R segment can be seen as seq.pdf. Once variable selection had been carried out in R, python was used for sequential analysis. This is implemented in seq.py in the seq folder.

blaisegarant · 2019-05-27T17:15:17Z

seq/seq.py

+#    plt.plot(dataset.values[:, group])
+#    plt.title(dataset.columns[group], y=0.5, loc='right')
+#    i+=1
+#plt.show


Why submit that amount of commented code?

blaisegarant · 2019-06-06T17:22:37Z

seq/seq.py

+
+
+#columns to be used decided from basic classification in R
+dataset = pandas.read_csv('sample.csv', usecols=[1,2,4,5,6,7,8,9,10,11,12,13,15,16,18,20], engine='python')


In your R section, you removed columns -c(1,4,24,17,21,23,14,19,22,25,26,27,28,29,30,31) (it would have been really easier to read if the numbers were in order)
This would translate to 0,3,13,16,18,20,21,22,23,24,25,26,27,28,29,30 in python. Yet, you used columns 13, 16, 18 and 20 in your model.

blaisegarant · 2019-06-06T17:25:59Z

seq.Rmd

+For comparison, we'll run a logistic regression using only variables above the threshold
+
+```{r}
+simdat <- sample[,-c(1,4,24,17,21,23,14,19,22,25,26,27,28,29,30,31)]


The first column is an ID so of course, it's not relevant by itself but it can be used to link the information here with the one in the res.csv file. Which you neither used nor seem to looked at. Any reason?

blaisegarant · 2019-06-06T17:33:28Z

seq/seq.py

+#columns to be used decided from basic classification in R
+dataset = pandas.read_csv('sample.csv', usecols=[1,2,4,5,6,7,8,9,10,11,12,13,15,16,18,20], engine='python')
+dataset_norm = (dataset - dataset.mean()) / (dataset.max() - dataset.min())
+dataset_norm["class"] = dataset["class"]


You normalized the class then set it back to it's original state which makes total sense but you kept the timestamp feature normalized. Why is that?

Since it's a time series, all future timestamp will only go further and further away from your mean.

blaisegarant · 2019-06-06T17:52:36Z

seq/seq.py

+    return agg
+
+scaler = MinMaxScaler(feature_range=(0,1))
+scaled = scaler.fit_transform(dataset_norm.values.astype('float32'))


Why scale them again? And/or why scale them compared to their mean in the first place?

blaisegarant · 2019-06-06T18:36:06Z

seq/seq.py

+    #aggregate
+    agg = pandas.concat(cols, axis=1)
+    agg.columns = names
+    #drop NaNs


I get what you wanted to do, but as a side note, this can be done using join in pandas which would make the whole function easier to read.

blaisegarant · 2019-06-06T18:40:50Z

seq/seq.py

+scaled = scaler.fit_transform(dataset_norm.values.astype('float32'))
+scaled[:,1] = scaled[:,1].astype('int')
+reframed = series_to_supervised(scaled, 1, 1)
+reframed.drop(reframed.columns[[16,18,19,20,21,22,23,24,25,26,27,28,29,30,31]], axis=1, inplace=True)


You added a bunch of columns to remove them altogether except one? You lost me there

blaisegarant · 2019-06-06T19:05:49Z

seq/seq.py

+#split into inputs and outputs
+train_X, train_y = train[:,:-1], train[:,-1]
+test_X, test_y = test[:,:-1], test[:,-1]
+#reshape for 3D input


Since you're essentially feeding only a 2D dataset, I don't see why you reshape it into a 3D matrix. Anyway, all roads lead to Rome with Keras.

blaisegarant · 2019-06-06T19:11:42Z

seq/seq.py

+print('Evaluated Accuracy of Model: %.3f' % (scores[1]*100) )
+
+
+


In a nutshell, you treated all this like a classic classification system only adding the last event state into your dataset.
In fact, with the present state of the system, you must predict what will be the state of the future events without knowing the features of the next events.

In the final phases afterwards, you create new events and try to predict their state. There are some issues with how you create them but this is not what was asked.

blaisegarant · 2019-06-06T19:14:19Z

seq/seq.py

+
+
+
+Xnew, _ = make_blobs(n_samples=1, centers=3, n_features=train_X.shape[2], random_state=1)


Two things here:

make_blobs creates gaussian samples while your model is scaled for values between [0-1]

You scaled your timestamp therefore any new event cannot have at a timestamp value <1

Again, this is irrelevant since it's not in line with the question.

jeffbulmer added 2 commits May 23, 2019 00:36

did question 2

bfca9e4

pushed missing files

bfa2f79

blaisegarant reviewed Jun 6, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented Solutions to Problems 2 and 3 #15

Implemented Solutions to Problems 2 and 3 #15

jeffbulmer commented May 27, 2019

blaisegarant May 27, 2019

blaisegarant Jun 6, 2019

blaisegarant Jun 6, 2019

blaisegarant Jun 6, 2019

blaisegarant Jun 6, 2019

blaisegarant Jun 6, 2019

blaisegarant Jun 6, 2019

blaisegarant Jun 6, 2019

blaisegarant Jun 6, 2019

blaisegarant Jun 6, 2019



		#columns to be used decided from basic classification in R
		dataset = pandas.read_csv('sample.csv', usecols=[1,2,4,5,6,7,8,9,10,11,12,13,15,16,18,20], engine='python')

		print('Evaluated Accuracy of Model: %.3f' % (scores[1]*100) )




		Xnew, _ = make_blobs(n_samples=1, centers=3, n_features=train_X.shape[2], random_state=1)

Implemented Solutions to Problems 2 and 3 #15

Are you sure you want to change the base?

Implemented Solutions to Problems 2 and 3 #15

Conversation

jeffbulmer commented May 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment