calibrate() doesn't work if the corpus is just in 2 files #4

samvelkoch · 2023-11-26T20:16:38Z

If we have train_corpus in 2 files (author1_-title.txt, author2-_title.txt) than calibrate(train_corpus) will drop an error:

calibrate(train_corpus)

lib/python3.10/dist-packages/numpy/core/_methods.py:265: RuntimeWarning: Degrees of freedom <= 0 for slice
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:257: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:265: RuntimeWarning: Degrees of freedom <= 0 for slice
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:257: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-26-cc70d17a9b30>](https://localhost:8080/#) in <cell line: 1>()
----> 1 calibrate(train_corpus)

5 frames
[/usr/local/lib/python3.10/dist-packages/faststylometry/probability.py](https://localhost:8080/#) in calibrate(corpus, model)
     77     ground_truths, delta_values = get_calibration_curve(corpus)
     78 
---> 79     model.fit(np.reshape(delta_values, (-1, 1)), ground_truths)
     80 
     81     corpus.probability_model = model

[/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py](https://localhost:8080/#) in fit(self, X, y, sample_weight)
   1194             _dtype = [np.float64, np.float32]
   1195 
-> 1196         X, y = self._validate_data(
   1197             X,
   1198             y,

[/usr/local/lib/python3.10/dist-packages/sklearn/base.py](https://localhost:8080/#) in _validate_data(self, X, y, reset, validate_separately, **check_params)
    582                 y = check_array(y, input_name="y", **check_y_params)
    583             else:
--> 584                 X, y = check_X_y(X, y, **check_params)
    585             out = X, y
    586 

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py](https://localhost:8080/#) in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
   1104         )
   1105 
-> 1106     X = check_array(
   1107         X,
   1108         accept_sparse=accept_sparse,

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py](https://localhost:8080/#) in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    919 
    920         if force_all_finite:
--> 921             _assert_all_finite(
    922                 array,
    923                 input_name=input_name,

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py](https://localhost:8080/#) in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    159                 "#estimators-that-handle-nan-values"
    160             )
--> 161         raise ValueError(msg_err)
    162 
    163 

ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calibrate() doesn't work if the corpus is just in 2 files #4

calibrate() doesn't work if the corpus is just in 2 files #4

samvelkoch commented Nov 26, 2023 •

edited

Loading

calibrate() doesn't work if the corpus is just in 2 files #4

calibrate() doesn't work if the corpus is just in 2 files #4

Comments

samvelkoch commented Nov 26, 2023 • edited Loading

samvelkoch commented Nov 26, 2023 •

edited

Loading