Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expected fields errors when reading full biosample tsv with pandas #52

Open
realmarcin opened this issue May 27, 2021 · 1 comment
Open

Comments

@realmarcin
Copy link
Collaborator

realmarcin commented May 27, 2021

When reading the full biosample tsv table with pandas like this:

df_biosample = pd.read_cvs("harmonized-table.tsv", sep="\t")

This error pops up deep into the file:

ParserError: Error tokenizing data. C error: Expected 464 fields in line 4929258, saw 542

I've checked the offending line and its neighbors using awk to count tabs and they all have 463 tabs (hence 464 fields). I also looked through the fields in those lines and didn't see any odd characters, just the usual strings, identifiers separated by | and dates.

FWIW the same error occur when using skiprows=2 which is a suggestion to deal with problematic headers (shouldn't be the case here anyway).

It's a bit of a puzzle.

@realmarcin
Copy link
Collaborator Author

Here is a snippet of rows, the offending row should be data row 2 or 3 or 4, depending on how the pandas code counts lines (w/ or w/o header, 0 or 1 offset).

When I read this snippet with the same pandas code there is no error! And the resulting dataframe is as expected with 4 rows and 464 columns.

harmonized-table_test.txt

.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant