-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented Solutions to Problems 2 and 3 #15
base: master
Are you sure you want to change the base?
Conversation
# plt.plot(dataset.values[:, group]) | ||
# plt.title(dataset.columns[group], y=0.5, loc='right') | ||
# i+=1 | ||
#plt.show |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why submit that amount of commented code?
|
||
|
||
#columns to be used decided from basic classification in R | ||
dataset = pandas.read_csv('sample.csv', usecols=[1,2,4,5,6,7,8,9,10,11,12,13,15,16,18,20], engine='python') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your R section, you removed columns -c(1,4,24,17,21,23,14,19,22,25,26,27,28,29,30,31)
(it would have been really easier to read if the numbers were in order)
This would translate to 0,3,13,16,18,20,21,22,23,24,25,26,27,28,29,30
in python. Yet, you used columns 13, 16, 18 and 20 in your model.
For comparison, we'll run a logistic regression using only variables above the threshold | ||
|
||
```{r} | ||
simdat <- sample[,-c(1,4,24,17,21,23,14,19,22,25,26,27,28,29,30,31)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first column is an ID so of course, it's not relevant by itself but it can be used to link the information here with the one in the res.csv
file. Which you neither used nor seem to looked at. Any reason?
#columns to be used decided from basic classification in R | ||
dataset = pandas.read_csv('sample.csv', usecols=[1,2,4,5,6,7,8,9,10,11,12,13,15,16,18,20], engine='python') | ||
dataset_norm = (dataset - dataset.mean()) / (dataset.max() - dataset.min()) | ||
dataset_norm["class"] = dataset["class"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You normalized the class then set it back to it's original state which makes total sense but you kept the timestamp feature normalized. Why is that?
Since it's a time series, all future timestamp will only go further and further away from your mean.
return agg | ||
|
||
scaler = MinMaxScaler(feature_range=(0,1)) | ||
scaled = scaler.fit_transform(dataset_norm.values.astype('float32')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why scale them again? And/or why scale them compared to their mean in the first place?
#aggregate | ||
agg = pandas.concat(cols, axis=1) | ||
agg.columns = names | ||
#drop NaNs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get what you wanted to do, but as a side note, this can be done using join
in pandas which would make the whole function easier to read.
scaled = scaler.fit_transform(dataset_norm.values.astype('float32')) | ||
scaled[:,1] = scaled[:,1].astype('int') | ||
reframed = series_to_supervised(scaled, 1, 1) | ||
reframed.drop(reframed.columns[[16,18,19,20,21,22,23,24,25,26,27,28,29,30,31]], axis=1, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You added a bunch of columns to remove them altogether except one? You lost me there
#split into inputs and outputs | ||
train_X, train_y = train[:,:-1], train[:,-1] | ||
test_X, test_y = test[:,:-1], test[:,-1] | ||
#reshape for 3D input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're essentially feeding only a 2D dataset, I don't see why you reshape it into a 3D matrix. Anyway, all roads lead to Rome with Keras.
print('Evaluated Accuracy of Model: %.3f' % (scores[1]*100) ) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a nutshell, you treated all this like a classic classification system only adding the last event state into your dataset.
In fact, with the present state of the system, you must predict what will be the state of the future events without knowing the features of the next events.
In the final phases afterwards, you create new events and try to predict their state. There are some issues with how you create them but this is not what was asked.
|
||
|
||
|
||
Xnew, _ = make_blobs(n_samples=1, centers=3, n_features=train_X.shape[2], random_state=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things here:
make_blobs
creates gaussian samples while your model is scaled for values between [0-1]- You scaled your timestamp therefore any new event cannot have at a timestamp value <1
Again, this is irrelevant since it's not in line with the question.
Hello,
I have implemented solutions to the second and third parts of the clever-challenge. The second part was done in golang using a simple text parser to parse the ast. The particular structure of this ast is exploited, though the solutions should work for many other asts.
The third part of the challenge was solved using R to run some basic classification algorithms on the data, primarily for the purposes of variable selection. This was done in R Markdown, so a summary of the whole R segment can be seen as seq.pdf. Once variable selection had been carried out in R, python was used for sequential analysis. This is implemented in seq.py in the seq folder.