Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain comments in AST #308

Open
12 tasks
turbolent opened this issue Aug 10, 2020 · 6 comments · May be fixed by #3509
Open
12 tasks

Retain comments in AST #308

turbolent opened this issue Aug 10, 2020 · 6 comments · May be fixed by #3509
Labels
Feature Good First Issue Good for newcomers

Comments

@turbolent
Copy link
Member

turbolent commented Aug 10, 2020

Issue To Be Solved

Currently the AST only retains comments for declarations if they are docstrings (/// line comments, a and /** block comments).

To be able to use the AST for formatting, it must retain also non-docstring comments.

Retain comments for:

  • Declarations
  • Statements
  • Expressions

Definition of Done

  • Add AST elements for comments:
    • Block comment
    • Line comment
  • Add a leading/trailing comments property
    • Declarations: Line and block comments
    • Statements: Line and block comments
    • Expressions: Block comments
  • Add comments to AST elements when parsing
    • Declarations
    • Statements
    • Expressions
  • Tests
@SupunS
Copy link
Member

SupunS commented Mar 9, 2021

I think we may have to capture the comments at a further granular level. Because a comment can exist at any place where whitespace can occur, not only at declarations/statements/expressions level, and we would have to retain those as well.
e.g. let /*Im a comment*/ x: Int

A solution would be to:

  • Capture doc comments only at the declaration level. (It is useful to have them separately)
  • Capture other comments (and any trivia including whitespace, newlines, etc, if needed) at the token level.

Roslyn (C#) compiler follows a similar approach: https://github.com/dotnet/roslyn/blob/main/docs/wiki/Roslyn-Overview.md#syntax-trivia

This way, it also divides the whitespace/comments between two consecutive non-trivia tokens as leading and trailing trivia (rather than having leading trivia only, or trailing trivia only), so that the comments are attached to the correct node/token. And moving that node around (code organizing/formatting), would also move the comment along with it.

e.g: See the bellow comments, which is the 'usual' way followed by most people.

// (1) The comment immediately before the decl is usually about 'x' decl
let x: Int  // (2) A comment on the same line also talks about 'x' decl (or about `Int`)
// (3) Similarly, this is a comment about 'y' decl
let y: Int  // (4) This is also a comment about 'y' decl

In the above, comments (1) and (2) would be a part of x-var-decl. Moving let x: 5 around would also move the comments (1) and (2) along with it.
Similarly, (3) and (4) would be a part of y-var-decl.

@lkadian
Copy link
Contributor

lkadian commented Mar 10, 2021

hi @SupunS Great comment, agree with your proposed solution.
Have you started work on this issue? I ask this because I had actually worked on this but never got around to completing it. I was planning to put up a PR soon, so would just like to confirm if you are also working on it so that we can avoid duplicate work!

@SupunS
Copy link
Member

SupunS commented Mar 10, 2021

@lkadian I haven't done or started any work on this. Great to hear that you have already been working on it! 👏
Would love to have a PR!

Looping in @MaxStalker.

@lkadian
Copy link
Contributor

lkadian commented Mar 29, 2021

This issue has been a bit trickier than I thought as it touches a lot of parts of the parsing code, and as a result it affects a lot of the tests. Posting some info on how I am implementing this, so that if anything seems completely off, someone can let me know! Also to mention that it's still in progress

The original issue mentions retaining trivia for declarations, statement and expressions, but as @SupunS pointed out it looks like we may need to capture leading/trailing trivia for each token. Basically, we'll have a full syntax tree where we hold every token, and each token will know about its leading/trailing trivia.

  • Restructure lexer so that tokens and trivia are separate concepts
  • Fix broken lexer tests
  • Add lexertests

Currently we hold trivia as regular tokens, but we need these to be a separate entity. Then when scanning, we will emit each token accompanied with it's leading and trailing trivia:

type Token struct {
	Type  TokenType
	Value interface{}
	ast.Range
	LeadingTrivia  []Trivia
	TrailingTrivia []Trivia
}

Rules for leading/trailing trivia (taken from Swift):

  1. A token owns all of its trailing trivia up to, but not including, the next newline character.
  2. Looking backward in the text, a token owns all of the leading trivia up to and including the first contiguous sequence of newlines characters
  • Restructure parser to work with above changes and support all tokens in the AST (full syntax tree)
  • Fix broken parser tests
  • Add parser tests
    Currently not all tokens are stored in the AST as they are not needed in later stages. But if we want this sort of "full syntax tree", we will need to keep all tokens for each syntax node.

@SupunS
Copy link
Member

SupunS commented Apr 13, 2021

Thanks, @lkadian for taking this forward and for the great progress!
Sorry for not being in touch recently - I somehow miss updates from this thread, probably have drowned with other stuff.

Currently we hold trivia as regular tokens, but we need these to be a separate entity. Then when scanning, we will emit each token accompanied with it's leading and trailing trivia:

type Token struct {
	Type  TokenType
	Value interface{}
	ast.Range
	LeadingTrivia  []Trivia
	TrailingTrivia []Trivia
}

Rules for leading/trailing trivia (taken from Swift):

  1. A token owns all of its trailing trivia up to, but not including, the next newline character.
  2. Looking backward in the text, a token owns all of the leading trivia up to and including the first contiguous sequence of newlines characters

This sounds like a pretty good solution to me! 👏

  • Restructure parser to work with above changes and support all tokens in the AST (full syntax tree)
  • Fix broken parser tests
  • Add parser tests
    Currently not all tokens are stored in the AST as they are not needed in later stages. But if we want this sort of "full syntax tree", we will need to keep all tokens for each syntax node.

That's a good point. I think we'll have to retain the full syntax tree, if we need to retain all the comments.

In the future, we could add an option to switch capturing the full syntax-tree on and off, depending on the use-case. For example, we need the full syntax tree and the trivia, when using in the IDE/playground during the development time, but don't really need them when parsing and running the code on-chain.

Maybe we can keep a flag for now, but have it 'on' by default, and later we can make it configurable. I'm not sure how big would that change (adding a flag) be. Another option is to add the flag altogether as a separate change, so that current changes don't get too big.

@j1010001
Copy link
Member

j1010001 commented Jan 30, 2023

We can split this into 2 milestones:

  1. Cover the majority of cases – we don't need to have perfect solution to cover every edge-case, good benchmark is to preserve all comments in contracts on Mainnet.
  2. Cover the remaining edge-cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Good First Issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants