Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break in compatibility with jupyterlab-lsp + python-lsp-ruff since 0.0.285 #6847

Closed
felix-cw opened this issue Aug 24, 2023 · 24 comments
Closed
Labels
cli Related to the command-line interface

Comments

@felix-cw
Copy link
Contributor

I'm not 100% sure this is the right project to raise this, since the problem comes from interaction with python-lsp-ruff and juptyerlab-lsp. I only raised it here since the issue arose from a change in ruff.

Since 0.0.285, the jupyterlab-lsp plugin now gives the following diagnostic message:

SyntaxError: Expected a Jupyter Notebook (.ipynb), which must be internally stored as JSON, but found a Python source file: expected value at line 1 column 1

I believe this is because jupyterlab-lsp extracts the python code from the notebook cells, then python-lsp-ruff passes it to ruff via stdin, and specifies --stdin-filename. Since #6628, this makes ruff expect ipynb JSON, rather than python code, which causes the error.

@dhruvmanila
Copy link
Member

Hey, thanks for opening the issue!

First of all, your observation is spot on in that the --stdin-filename is used by python-lsp-ruff1 and that JupyterLab is probably sending the source code as Python (instead of JSON) to the server2 (They have a concept of Virtual Document). Here, the server is python-lsp-ruff which uses the Ruff executable.

You can see the URI for the document which has an extension of .ipynb:

Screenshot 2023-08-25 at 07 58 24

A few solutions which come to my mind:

  • python-lsp-ruff would always update the document URI to use .py extension but that would mean any per-file-ignores using the .ipynb extension won't work
  • A new CLI flag to specify the source type explicitly (similar to the --parser flag in Prettier):
     --parser <flow|babel|babel-flow|babel-ts|typescript|acorn|espree|meriyah|css|less|scss|json|json5|json-stringify|graphql|markdown|mdx|vue|yaml|glimmer|html|angular>
                              Which parser to use.
    

Feel free to share any other suggestions :)

Footnotes

  1. https://github.com/python-lsp/python-lsp-ruff/blob/f6707dfab9c39cbab23b8bc89646341164435afc/pylsp_ruff/plugin.py#L503-L505

  2. https://jupyterlab-lsp.readthedocs.io/en/latest/Architecture.html#positioning-system

@charliermarsh charliermarsh added cli Related to the command-line interface needs-decision Awaiting a decision from a maintainer labels Aug 25, 2023
@felix-cw
Copy link
Contributor Author

felix-cw commented Oct 3, 2023

Hi all. Sorry for posting on an old issue. For me, I would prefer option 2: a new CLI flag, so that options such as per-file-ignores can work both in a jupyter lsp context as well as when running ruff on notebooks in command lines, pre-commit etc.

An example of a per-file-ignore that can definitely be scoped to ipynb files is B018.

@zanieb
Copy link
Member

zanieb commented Oct 3, 2023

Could our notebook parser (or a step upstream of it) just be updated to wrap the content in a single cell if it's not already? I don't see a parser option as a user-facing setting.

@felix-cw
Copy link
Contributor Author

felix-cw commented Oct 3, 2023

I think that could work. My concern was that since ruff output includes cell numbers, it would interfere with the part of jupyterlab-lsp which maps line numbers in the linter output back to the original cells.
However, it looks like when --format=json it just gives line numbers so should be fine.

@charliermarsh
Copy link
Member

I think the exclusion of cell numbers is a bug in the JSON format is a bug (see, e.g., #7664). Broadly, I'm worried that treating Python source as Jupyter could cause other problems, some unanticipated (e.g., if you accepted the code via stdin, we'd then output notebook JSON when formatting or autofixing).

Without doing deeper research, my current preferences would be something like:

  1. Figure out how to get jupyterlab-lsp to send us the full notebook. This would the most reliable and it would avoid unnecessarily converting from Jupyter to Python back to Jupyter.
  2. Add a dedicated flag to determine the source type (like --parser). I think this is reasonable, even if only intended for LSPs and other programmatic clients.
  3. If we parse a notebook, and it's not JSON, fall back to parsing as Python? This is sort of similar to Zanie's suggestion but internally we'd treat the whole thing as if it were Python rather than trying to convert it to Jupyter. This feels like a structural error though, I don't love it.

@zanieb
Copy link
Member

zanieb commented Oct 3, 2023

Agree that 1. would be preferable. 2. is fine with me if we hide it from the CLI help menu until there are more user-focused use-cases.

@felix-cw
Copy link
Contributor Author

felix-cw commented Oct 6, 2023

Thanks everyone! I thought that it might be too complicated to get python-lsp-ruff/jupyterlab-lsp to use linters that work on notebooks directly, so I made a PR implementing the --parser option. Thanks for the opportunity to contribute to such a useful and amazing project!

@MichaReiser
Copy link
Member

My concern with 2 is that, compared to Prettier, we don't support different parsers. I prefer a CLI flag that is more high level and instead allows overriding the file type (mime type remapping), which can be useful in other use cases other than the one mentioned here.

@charliermarsh
Copy link
Member

Can you spell out in a bit more detail what this CLI flag would look like and/or how it would differ from the above?

@dhruvmanila
Copy link
Member

overriding the file type (mime type remapping)

Wouldn't that be in conflict with the inferred filetype from the filename's extension provided via --stdin-filename? Or, would that mean that the inference will be avoided?

FWIW, in ruff-dev, I added an option --jupyter which would just mean that the content of the Python file is from a Jupyter Notebook. This helped in development where I didn't need to run a Jupyter server for testing purposes. I think this is a similar case where we want to tell Ruff that the content of this file is from a different filetype.

@MichaReiser
Copy link
Member

Wouldn't that be in conflict with the inferred filetype from the filename's extension provided via --stdin-filename? Or, would that mean that the inference will be avoided?

What I have in mind is that you can map extensions to other files. I don't have a good python example but JSON supports different standards and a user may decide that all their .json files are JSON5. They could then configure *.json = JSON5. I guess it's similar to parser except that we define the language rather than the parser (because we don't support different Python parser).

We could expose the same option on the CLI and it could then be used as --extension ipnby=python or similar.

@charliermarsh
Copy link
Member

Interesting, so --extension ipynb=python could be repeated?

@zanieb
Copy link
Member

zanieb commented Oct 20, 2023

I kind of like that API

@krassowski
Copy link

krassowski commented Oct 20, 2023

Hi, some context from jupyter-lsp side:

  • when jupyter-lsp was created the LSP protocol had exclusively support for plain text files
  • while the most recent LSP version (3.17) embraced notebooks, it does it in a bit convoluted way, sadly not aligned with Jupyter standards; one consequence of it is that even if jupyterlab-lsp implements support for the notebook messages from 3.17, it will still not pass the true notebook format JSON but rather a quasi-notebook structure implementing the NotebookDocument interface which diverges from the Jupyter ipynb standard (e.g. kind instead of cell_type, executionSummary.executionOrder instead of execution_count etc) and in which the cell sources are not actually stored but abstracted out to other documents which needs to be retrieved separately.
  • there are some pros and cons for jupyter-lsp sending concatenated source rather than pure JSON:
    • (+) we can piece together nested documents having different languages (e.g. ipython cells with %%sql can be hidden from the parent ruff server)
    • (+) we do not need to worry about stripping metadata/outputs which can weight hundreds of MB in a notebook
    • (-) we loose structure information and thus the ability to provide some cool hints such as "3 empty cells" or "Code in a markdown cell will not be executed, did you change the cell type by accident"

The last issue about loss of structure could be alleviated if Jupyter ends up formalising a canonical text-based representation of notebooks, currently being discussed from the Markdown-first perspective in jupyter/enhancement-proposals#103 and even more if it includes a way to preserve cell-level metadata, but this has less of an impact on a discussion herein, maybe beyond naming of a future option (ipynb is very concrete Jupyter standard, but future Jupyter notebook serialization formats may allow other formats, so maybe it is best to avoid option names like --jupyter or only make them an alias for say --ipynb).

jupyter-lsp could potentially abuse LSP protocol and send the JSON dump as if it was plain text. I would welcome PRs (in jupyterlab/jupyterlab-lsp) which explore this idea, but as of now I am not convinced how well it would work and how much work is needed to make it practical. Again, the problem with the idea that ruff can format notebooks is that if we want to have a quick reply times, we still need to strip the large visualisations from cell outputs and then stitch them back together (as the time 100MB travels on the wire, especially for remote servers, far outweighs the benefits of ruff being blazing fast); if ruff were to rearrange/remove/add/merge cells in reformatting this would be a bit hard to achieve (but possible if we just replaced the outputs with dummy tokens for the round trip).

@dhruvmanila
Copy link
Member

We could expose the same option on the CLI and it could then be used as --extension ipnby=python or similar.

So, the mapping will be from file extension to the language (or file extension ipynb=py?). This means one could also do py=ipynb which would mean that the Python file should contain a JSON representing the Notebook. I would prefer to introduce this as a hidden flag for now until we finalize a general way of providing multi-language support.


jupyter-lsp could potentially abuse LSP protocol and send the JSON dump as if it was plain text.

Yes, we're doing this in astral-sh/ruff-lsp#264 where the LSP creates a JSON text representing the Notebook in VSCode which is what is being sent to the Ruff CLI. This works for us as we're in control of what fields we use from the JSON format and I agree that it might be difficult to incorporate this in a general way for multiple tools.

@felix-cw
Copy link
Contributor Author

We could expose the same option on the CLI and it could then be used as --extension ipnby=python or similar.

I've had a go at implementing this idea as a hidden flag. Is it OK if I raise a PR for it and close the previous PR?

@dhruvmanila
Copy link
Member

I'm not exactly sure if this is going to be enough. Ruff keeps track of the cell boundaries which are used to separate import blocks in each cell. So, for example, take the following two cells:

Cell 1:

from math import pi
import os

Cell 2:

from pathlib import Path
import sys

Correct me if I'm wrong but the concatenated source code which Ruff would receive through Jupyterlab LSP would be:

from math import pi
import os
from pathlib import Path
import sys

The imports would be sorted as:

import os
import sys
from math import pi
from pathlib import Path

And then mapped back to the Notebook as:

Cell 1:

import os
import sys

Cell 2:

from math import pi
from pathlib import Path

Notice that the imports have been moved between cells. Ruff's native implementation wouldn't do that and only sort imports in each cell.

@felix-cw
Copy link
Contributor Author

That's interesting. Are there any rules other than I001 where the diagnostic can span multiple cells?
As far as I know, right now jupyterlab-lsp doesn't support formatting/fixing so it can't move import statements between cells.
Is the concern that this will lead to a false-positive?

@charliermarsh
Copy link
Member

@felix-cw - I001 is the only rule whose diagnostic can span multiple cells, but there are multiple rules whose fixes can span multiple cells (i.e., you may need to add an import to the top of the file in order to access a new standard library module for the code edit).

@dhruvmanila
Copy link
Member

As far as I know, right now jupyterlab-lsp doesn't support formatting/fixing so it can't move import statements between cells.

Oh, interesting! Thanks for letting us know. So, it seems that they don't support code actions / formatting which are two main ways that we would edit a Notebook content. But, they do plan to support them in the future jupyter-lsp/jupyterlab-lsp#985.

Is the concern that this will lead to a false-positive?

I'm not exactly sure. My main concern is that I don't know how Jupyterlab LSP maps the location information between the actual Notebook source and the concatenated source. This would probably be updated when they plan to integrate editing cell content via code actions / formatting.

There does exists a documentation around their source map model (https://github.com/jupyter-lsp/jupyterlab-lsp/blob/main/docs/Architecture.ipynb) which seems to be mapping the (line, character) value between multiple types of document and concatenating the source code with 2 blank lines?

We could, theoretically, support diagnostics using any of the possible solutions mentioned in this issue but it might probably break when they add support for code actions.

@krassowski
Copy link

I think you do not need to worry about how jupyterlab-lsp handles mapping between cells and text content, I am sure this will work well. And yes there is a branch with code actions support but I cannot test it against ruff yet because it just rejects text content on the basis of extension ;)

dhruvmanila pushed a commit that referenced this issue Nov 8, 2023
…le extension (#8373)

## Summary

This PR addresses the incompatibility with `jupyterlab-lsp` +
`python-lsp-ruff` arising from the inference of source type from file
extension, raised in #6847.

In particular it follows the suggestion in
#6847 (comment) to
specify a mapping from file extension to source type.

The source types are

- python
- pyi
- ipynb

Usage:

```sh
ruff check --no-cache --stdin-filename Untitled.ipynb --extension ipynb:python
```

Unlike the original suggestion, `:` instead of `=` is used to associate
file extensions to language since that is what is used with
`--per-file-ignores` which is an existing option that accepts a mapping.

## Test Plan

2 tests added to `integration_test.rs` to ensure the override works as
expected

---------

Co-authored-by: Charlie Marsh <[email protected]>
@dhruvmanila dhruvmanila removed the needs-decision Awaiting a decision from a maintainer label Nov 8, 2023
@dhruvmanila
Copy link
Member

Resolved in #8373

@felix-cw Do you mind providing the next steps required to make this integration work? Is it mostly on the user side? If so, I can update the documentation. Are there any changes required from either jupyterlab-lsp / python-lsp-ruff side?

@felix-cw
Copy link
Contributor Author

felix-cw commented Nov 8, 2023

Thanks for merging my pull request!

I think we need to add --extension ipynb:python to the call to ruff that python-lsp-ruff makes. I'm not sure about the side effects of this, when it is used outside of jupyterlab-lsp though, so it might need to be behind a config option. I'll reach out to the projects.

@dhruvmanila
Copy link
Member

This is resolved when using v2.0.0 of python-lsp-ruff, big thanks to @felix-cw for driving this change forward 🥳

https://github.com/python-lsp/python-lsp-ruff/releases/tag/v2.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli Related to the command-line interface
Projects
None yet
Development

No branches or pull requests

6 participants