How can I override the feature semantic for multi-dimensional features? #145

NashShuai · 2024-11-09T20:52:12Z

I am trying to run a RF regression based on my dataset. My dataframe looks like this [below is only the first 5 rows]:

              target                                          feat_1  \
county  year                                                                 
County_1 2000   0.047879  [0, 2, 10, 9, 12, 10, 9, 20, 35, 51, 0, 0, 0, ...   
        2001  -0.112184  [0, 1, 0, 2, 1, 2, 4, 9, 18, 34, 0, 1, 0, 1, 1...   
        2002   0.060659  [0, 0, 0, 0, 3, 24, 33, 32, 42, 58, 0, 0, 0, 2...   
        2003   0.098047  [0, 0, 1, 5, 13, 22, 40, 38, 29, 42, 0, 0, 0, ...   
        2004  -0.053559  [0, 1, 0, 2, 6, 8, 14, 33, 34, 64, 0, 0, 1, 1,...   

                                                      feat_2  
county  year                                                     
County_1 2000  [1.8121698113207556, 0.938584905660378, -0.568...  
        2001  [2.6941509433962274, 3.888301886792455, 2.8169...  
        2002  [-3.4043396226415084, -3.458113207547169, -3.5...  
        2003  [-1.9566037735849044, -2.3393396226415084, -2....  
        2004  [-3.2046226415094323, -3.502075471698112, -2.9...

When running the code to do the regression using only feat_1, the code works perfectly fine.

FEATURES_IN = ['feat_1']
learner = ydf.RandomForestLearner(label='target', features=FEATURES_IN, task=ydf.Task.REGRESSION)
model = learner.train(df_train)

[it works well. I can use model.describe() later to read the model]

However, as I include feat_2 in the regression,

FEATURES_IN = ['feat_1', 'feat_2']
learner = ydf.RandomForestLearner(label='target', features=FEATURES_IN, task=ydf.Task.REGRESSION)
model = learner.train(df_train)

It raises the error:

ValueError: Cannot import column 'feat_2' with semantic=Semantic.CATEGORICAL_SET as it contains floating point values.
Note: If the column is a label, make sure the correct task is selected. For example, you cannot train a classification model (task=ydf.Task.CLASSIFICATION) with floating point labels.

In this case, I am not sure how to override the feature semantic for multi-dimensional features. I could not find it in your documentation. I tried to use

FEATURES_IN = [
    ydf.Feature("feat_1", ydf.Semantic.NUMERICAL),
    ydf.Feature("feat_2", ydf.Semantic.NUMERICAL),
]
learner = ydf.RandomForestLearner(label='target', features=FEATURES_IN, task=ydf.Task.REGRESSION)
model = learner.train(df_train)

but it seems not to work:

ValueError: Cannot convert NUMERICAL column 'feat_1' of type numpy's array of 'object' and with content=array([array([  0,   2,  10,   9,  12,  10,   9,  20,  35,  51,   0,   0,   0,
                0,   0,   3,   7,  13,  12,  36,   0,   0,   0,   4,   0,   6,
               10,  11,  36,  63,   0,   0,   1,   0,   3,   8,  27,  34,  60,
               93,   0,   0,   0,   0,   0,   3,   8,   9,  25,  18,   0,   0,
                0,   0,   0,   4,   2,  13,  11,  15,   0,   3,  30, 179, 159,
              102,  87,  85,  60,  68])                                       ,
       array([ 0,  1,  0,  2,  1,  2,  4,  9, 18, 34,  0,  1,  0,  1,  1,  5,  8,
              30, 44, 67,  0,  0,  0,  1,  0,  2, 13, 26, 33, 63,  0,  0,  0,  0,
               0,  0,  3, 13, 21, 27,  0,  0,  0,  0,  1,  1,  4,  6, 11, 25,  0,
               0,  1,  2,  4,  1,  9, 20, 40, 41,  1,  1,  2,  6,  4, 10, 19, 25,
              18, 47])                                                  
[The error message prints the whole array, which is way too long. I omit the contents in the middle here.]
      array([  136,  2545, 11700, 18486, 15007, 10840,  7356,  5265,  5448,
               5890,    84,   140,   119,   156,   260,   646,  1778,  2549,
               2890,  2992,     0,     1,     3,     8,    17,    20,    72,
                 91,   151,   179,     0,     0,     1,     4,     5,    12,
                 16,    18,    68,    98,     2,     1,     0,     0,     7,
                 16,    15,    21,    46,    94,     0,     0,     2,    11,
                 12,    23,    16,    34,    67,   117,     4,    10,    25,
                 34,    31,    30,    43,    66,    97,   180])             ],
      dtype=object) to np.float32 values.
Note: If the column is a label, make sure the training task is compatible. For example, you cannot train a regression model (task=ydf.Task.REGRESSION) on a string column.

In this case, I am not sure how to deal with it.

The text was updated successfully, but these errors were encountered:

rstz · 2024-11-11T07:04:45Z

In the current implementation, YDF does not support unpacking pandas dataframes to multi-dimensional features, but I'll take this as a feature request to add this functionality. In the first case you're describing, YDF actually creates categorical set features for each column, which is probably not what you want.

If you want to unpack multi-dimensional features, feed the dataset e.g. as a dictionary of Numpy arrays, as shown in this tutorial

Improving the feature handling is a goal for the next version.

NashShuai · 2024-11-11T07:13:26Z

In the current implementation, YDF does not support unpacking pandas dataframes to multi-dimensional features, but I'll take this as a feature request to add this functionality. In the first case you're describing, YDF actually creates categorical set features for each column, which is probably not what you want.

If you want to unpack multi-dimensional features, feed the dataset e.g. as a dictionary of Numpy arrays, as shown in this tutorial

Improving the feature handling is a goal for the next version.

Thank you for your kind reply! I will try reformatting the feature as Numpy arrays for the model. Besides, given this, I think that at the present stage, it might be helpful if you could consider adding some notes and explanations in that tutorial so that other people can also know the function of your library better. Thank you again for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I override the feature semantic for multi-dimensional features? #145

How can I override the feature semantic for multi-dimensional features? #145

NashShuai commented Nov 9, 2024

rstz commented Nov 11, 2024

NashShuai commented Nov 11, 2024 •

edited

Loading

How can I override the feature semantic for multi-dimensional features? #145

How can I override the feature semantic for multi-dimensional features? #145

Comments

NashShuai commented Nov 9, 2024

rstz commented Nov 11, 2024

NashShuai commented Nov 11, 2024 • edited Loading

NashShuai commented Nov 11, 2024 •

edited

Loading