You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many approximation algorithms converge better when the data is normalized (zero to one) and scaled (mean==0) . Could you please investigate if this is a possible issue that would be interesting to show?
If positive, we can easily run the algorithms with four different subsets of the utility increase data combining small and large kurtosis and skewness. We would be looking at how quickly (number of episodes) each run achieves a certain level of exploitation (reduces exploration) and how quick it reaches the maximum reward (within a determined margin of error). These are charts that Nico has already developed.
The text was updated successfully, but these errors were encountered:
I thought about this one again. Normalization is used in ML algorithms to normalize the impact of different predictor variables on the target variable. However, we only have a single input variable, the reward. Therefore normalization will probably have no effect because in the current RL algorithms, there is no part sensible to the absolute size of the rewards.
Regarding the transformation of the raw utilities: I don't think this is useful either because we want the agent to maximize the total utility/reward.
r_1 + r_2 + ... + r_n
However, if we somehow use a function f to alter all the rewards r_1 .. r_n we maximize the following function:
f(r_1) + f(r_2) + .. + f(r_n)
Therefore, I propose we drop this part of modifying the input data. An interesting task, however, would be to analyze the data to find possible faults/interesting characteristics, which will help us understanding the results later on.
Well I did some research about this too. Currently our aim is to be able to better predict / distinguish between the <component, failure> combinations. Some important points about normalization:
the type of distribution doesn't change
allows your the agent to distinguish good and bad actions more effectively
reduces training time
Then I red a little bit more about it and could help to deal with a non-stationary environment, too. Since in reinforcement learning the policy of behavior can change during learning, thereby it changes the distribution and magnitude of the values. An approach to deal with that is presented here: https://arxiv.org/pdf/1602.07714.pdf
Currently, I haven't a proved approach how to implement normalisation (since it isn't always about to simply scale it to [-1;1]) but I would not prefer to drop this idea right now.
@brrrachel and @2start (Nico)
Many approximation algorithms converge better when the data is normalized (zero to one) and scaled (mean==0) . Could you please investigate if this is a possible issue that would be interesting to show?
If positive, we can easily run the algorithms with four different subsets of the utility increase data combining small and large kurtosis and skewness. We would be looking at how quickly (number of episodes) each run achieves a certain level of exploitation (reduces exploration) and how quick it reaches the maximum reward (within a determined margin of error). These are charts that Nico has already developed.
The text was updated successfully, but these errors were encountered: