Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented Solutions to Problems 2 and 3 #15

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jeffbulmer
Copy link

Hello,

I have implemented solutions to the second and third parts of the clever-challenge. The second part was done in golang using a simple text parser to parse the ast. The particular structure of this ast is exploited, though the solutions should work for many other asts.
The third part of the challenge was solved using R to run some basic classification algorithms on the data, primarily for the purposes of variable selection. This was done in R Markdown, so a summary of the whole R segment can be seen as seq.pdf. Once variable selection had been carried out in R, python was used for sequential analysis. This is implemented in seq.py in the seq folder.

# plt.plot(dataset.values[:, group])
# plt.title(dataset.columns[group], y=0.5, loc='right')
# i+=1
#plt.show
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why submit that amount of commented code?



#columns to be used decided from basic classification in R
dataset = pandas.read_csv('sample.csv', usecols=[1,2,4,5,6,7,8,9,10,11,12,13,15,16,18,20], engine='python')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your R section, you removed columns -c(1,4,24,17,21,23,14,19,22,25,26,27,28,29,30,31) (it would have been really easier to read if the numbers were in order)
This would translate to 0,3,13,16,18,20,21,22,23,24,25,26,27,28,29,30 in python. Yet, you used columns 13, 16, 18 and 20 in your model.

For comparison, we'll run a logistic regression using only variables above the threshold

```{r}
simdat <- sample[,-c(1,4,24,17,21,23,14,19,22,25,26,27,28,29,30,31)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first column is an ID so of course, it's not relevant by itself but it can be used to link the information here with the one in the res.csv file. Which you neither used nor seem to looked at. Any reason?

#columns to be used decided from basic classification in R
dataset = pandas.read_csv('sample.csv', usecols=[1,2,4,5,6,7,8,9,10,11,12,13,15,16,18,20], engine='python')
dataset_norm = (dataset - dataset.mean()) / (dataset.max() - dataset.min())
dataset_norm["class"] = dataset["class"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You normalized the class then set it back to it's original state which makes total sense but you kept the timestamp feature normalized. Why is that?

Since it's a time series, all future timestamp will only go further and further away from your mean.

return agg

scaler = MinMaxScaler(feature_range=(0,1))
scaled = scaler.fit_transform(dataset_norm.values.astype('float32'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why scale them again? And/or why scale them compared to their mean in the first place?

#aggregate
agg = pandas.concat(cols, axis=1)
agg.columns = names
#drop NaNs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what you wanted to do, but as a side note, this can be done using join in pandas which would make the whole function easier to read.

scaled = scaler.fit_transform(dataset_norm.values.astype('float32'))
scaled[:,1] = scaled[:,1].astype('int')
reframed = series_to_supervised(scaled, 1, 1)
reframed.drop(reframed.columns[[16,18,19,20,21,22,23,24,25,26,27,28,29,30,31]], axis=1, inplace=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added a bunch of columns to remove them altogether except one? You lost me there

#split into inputs and outputs
train_X, train_y = train[:,:-1], train[:,-1]
test_X, test_y = test[:,:-1], test[:,-1]
#reshape for 3D input
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're essentially feeding only a 2D dataset, I don't see why you reshape it into a 3D matrix. Anyway, all roads lead to Rome with Keras.

print('Evaluated Accuracy of Model: %.3f' % (scores[1]*100) )



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a nutshell, you treated all this like a classic classification system only adding the last event state into your dataset.
In fact, with the present state of the system, you must predict what will be the state of the future events without knowing the features of the next events.

In the final phases afterwards, you create new events and try to predict their state. There are some issues with how you create them but this is not what was asked.




Xnew, _ = make_blobs(n_samples=1, centers=3, n_features=train_X.shape[2], random_state=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things here:

  1. make_blobs creates gaussian samples while your model is scaled for values between [0-1]
  2. You scaled your timestamp therefore any new event cannot have at a timestamp value <1

Again, this is irrelevant since it's not in line with the question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants