Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I override the feature semantic for multi-dimensional features? #145

Open
NashShuai opened this issue Nov 9, 2024 · 2 comments
Open

Comments

@NashShuai
Copy link

I am trying to run a RF regression based on my dataset. My dataframe looks like this [below is only the first 5 rows]:

              target                                          feat_1  \
county  year                                                                 
County_1 2000   0.047879  [0, 2, 10, 9, 12, 10, 9, 20, 35, 51, 0, 0, 0, ...   
        2001  -0.112184  [0, 1, 0, 2, 1, 2, 4, 9, 18, 34, 0, 1, 0, 1, 1...   
        2002   0.060659  [0, 0, 0, 0, 3, 24, 33, 32, 42, 58, 0, 0, 0, 2...   
        2003   0.098047  [0, 0, 1, 5, 13, 22, 40, 38, 29, 42, 0, 0, 0, ...   
        2004  -0.053559  [0, 1, 0, 2, 6, 8, 14, 33, 34, 64, 0, 0, 1, 1,...   

                                                      feat_2  
county  year                                                     
County_1 2000  [1.8121698113207556, 0.938584905660378, -0.568...  
        2001  [2.6941509433962274, 3.888301886792455, 2.8169...  
        2002  [-3.4043396226415084, -3.458113207547169, -3.5...  
        2003  [-1.9566037735849044, -2.3393396226415084, -2....  
        2004  [-3.2046226415094323, -3.502075471698112, -2.9...  

When running the code to do the regression using only feat_1, the code works perfectly fine.

FEATURES_IN = ['feat_1']
learner = ydf.RandomForestLearner(label='target', features=FEATURES_IN, task=ydf.Task.REGRESSION)
model = learner.train(df_train)

[it works well. I can use model.describe() later to read the model]

However, as I include feat_2 in the regression,

FEATURES_IN = ['feat_1', 'feat_2']
learner = ydf.RandomForestLearner(label='target', features=FEATURES_IN, task=ydf.Task.REGRESSION)
model = learner.train(df_train)

It raises the error:

ValueError: Cannot import column 'feat_2' with semantic=Semantic.CATEGORICAL_SET as it contains floating point values.
Note: If the column is a label, make sure the correct task is selected. For example, you cannot train a classification model (task=ydf.Task.CLASSIFICATION) with floating point labels.

In this case, I am not sure how to override the feature semantic for multi-dimensional features. I could not find it in your documentation. I tried to use

FEATURES_IN = [
    ydf.Feature("feat_1", ydf.Semantic.NUMERICAL),
    ydf.Feature("feat_2", ydf.Semantic.NUMERICAL),
]
learner = ydf.RandomForestLearner(label='target', features=FEATURES_IN, task=ydf.Task.REGRESSION)
model = learner.train(df_train)

but it seems not to work:

ValueError: Cannot convert NUMERICAL column 'feat_1' of type numpy's array of 'object' and with content=array([array([  0,   2,  10,   9,  12,  10,   9,  20,  35,  51,   0,   0,   0,
                0,   0,   3,   7,  13,  12,  36,   0,   0,   0,   4,   0,   6,
               10,  11,  36,  63,   0,   0,   1,   0,   3,   8,  27,  34,  60,
               93,   0,   0,   0,   0,   0,   3,   8,   9,  25,  18,   0,   0,
                0,   0,   0,   4,   2,  13,  11,  15,   0,   3,  30, 179, 159,
              102,  87,  85,  60,  68])                                       ,
       array([ 0,  1,  0,  2,  1,  2,  4,  9, 18, 34,  0,  1,  0,  1,  1,  5,  8,
              30, 44, 67,  0,  0,  0,  1,  0,  2, 13, 26, 33, 63,  0,  0,  0,  0,
               0,  0,  3, 13, 21, 27,  0,  0,  0,  0,  1,  1,  4,  6, 11, 25,  0,
               0,  1,  2,  4,  1,  9, 20, 40, 41,  1,  1,  2,  6,  4, 10, 19, 25,
              18, 47])                                                  
[The error message prints the whole array, which is way too long. I omit the contents in the middle here.]
      array([  136,  2545, 11700, 18486, 15007, 10840,  7356,  5265,  5448,
               5890,    84,   140,   119,   156,   260,   646,  1778,  2549,
               2890,  2992,     0,     1,     3,     8,    17,    20,    72,
                 91,   151,   179,     0,     0,     1,     4,     5,    12,
                 16,    18,    68,    98,     2,     1,     0,     0,     7,
                 16,    15,    21,    46,    94,     0,     0,     2,    11,
                 12,    23,    16,    34,    67,   117,     4,    10,    25,
                 34,    31,    30,    43,    66,    97,   180])             ],
      dtype=object) to np.float32 values.
Note: If the column is a label, make sure the training task is compatible. For example, you cannot train a regression model (task=ydf.Task.REGRESSION) on a string column.

In this case, I am not sure how to deal with it.

@rstz
Copy link
Collaborator

rstz commented Nov 11, 2024

In the current implementation, YDF does not support unpacking pandas dataframes to multi-dimensional features, but I'll take this as a feature request to add this functionality. In the first case you're describing, YDF actually creates categorical set features for each column, which is probably not what you want.

If you want to unpack multi-dimensional features, feed the dataset e.g. as a dictionary of Numpy arrays, as shown in this tutorial

Improving the feature handling is a goal for the next version.

@NashShuai
Copy link
Author

NashShuai commented Nov 11, 2024

In the current implementation, YDF does not support unpacking pandas dataframes to multi-dimensional features, but I'll take this as a feature request to add this functionality. In the first case you're describing, YDF actually creates categorical set features for each column, which is probably not what you want.

If you want to unpack multi-dimensional features, feed the dataset e.g. as a dictionary of Numpy arrays, as shown in this tutorial

Improving the feature handling is a goal for the next version.

Thank you for your kind reply! I will try reformatting the feature as Numpy arrays for the model. Besides, given this, I think that at the present stage, it might be helpful if you could consider adding some notes and explanations in that tutorial so that other people can also know the function of your library better. Thank you again for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants