expected fields errors when reading full biosample tsv with pandas #52

realmarcin · 2021-05-27T22:13:48Z

When reading the full biosample tsv table with pandas like this:

df_biosample = pd.read_cvs("harmonized-table.tsv", sep="\t")

This error pops up deep into the file:

ParserError: Error tokenizing data. C error: Expected 464 fields in line 4929258, saw 542

I've checked the offending line and its neighbors using awk to count tabs and they all have 463 tabs (hence 464 fields). I also looked through the fields in those lines and didn't see any odd characters, just the usual strings, identifiers separated by | and dates.

FWIW the same error occur when using skiprows=2 which is a suggestion to deal with problematic headers (shouldn't be the case here anyway).

It's a bit of a puzzle.

realmarcin · 2021-05-27T22:20:02Z

Here is a snippet of rows, the offending row should be data row 2 or 3 or 4, depending on how the pandas code counts lines (w/ or w/o header, 0 or 1 offset).

When I read this snippet with the same pandas code there is no error! And the resulting dataframe is as expected with 4 rows and 464 columns.

harmonized-table_test.txt

.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expected fields errors when reading full biosample tsv with pandas #52

expected fields errors when reading full biosample tsv with pandas #52

realmarcin commented May 27, 2021 •

edited

Loading

realmarcin commented May 27, 2021

expected fields errors when reading full biosample tsv with pandas #52

expected fields errors when reading full biosample tsv with pandas #52

Comments

realmarcin commented May 27, 2021 • edited Loading

realmarcin commented May 27, 2021

realmarcin commented May 27, 2021 •

edited

Loading