-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CodeBuffer: implement .subSequence(), plus tests #2
base: master
Are you sure you want to change the base?
Conversation
Thanks for PR, it will be reviewed during next sprint on SSLR. |
Implementation of method "subSequence" can be copy-on-read or wrapping of underlying array, and both have their advantages and disadvantages depending on scenarios of usage. Since currently there is no potential consumers of this feature nor in SSLR itself, nor in main consumers of this library (SonarQube plugins), scenarios of usage are unknown and method left unimplemented as a kind of defensive approach against disadvantages of both implementations. Now |
Well, another possibility would have been to wrap the whole byte[] into a CharBuffer to begin with. A .subsequence() would then have been marginally cheaper. The problem here is basically that |
Not hypothetical at all and defensive principle is exactly to avoid bad surprises. And again - "I want" is not a concrete usage scenario. |
Sorry but those are purely hypothetical. And like I said, another implementation (I can provide it) would just have been to wrap the whole char array. This way a call to .subSequence() would just reuse that instance-wide charbuffer. And I maintain: the principle of least surprise takes priority over anything else. I'll redo the patch with a global |
Done. Note however that this could be vastly improved. In particular, it looks like pop() could be rewritten as a simple get() on the buffer which in turn means the offset calculations could be simplified. If you want me to work on this aspect, I can do that too. Not only that but it can also lead the way to real unicode support, as I do in grappa. |
And I repeat - not hypothetical, but a real issue faced in past. Current implementation is the principle of least surprise for us. Update has exactly the same flaw as previous change. Instead of being aggressively pushy, might be better to start listening to what others say. So once again - how you use subsequence (usage scenario)? Without this another solution to problem of contract - remove inheritance of CharSequence, ticket created - http://jira.sonarsource.com/browse/SSLR-372 |
In grappa, I use what is called an What parboiled did in the past, for no good reason, was to specialize implementations for I was quite happy to see that /**
* An {@link InputBuffer} over a {@link CodeReader}
*
* <p>Unfortunately, this is required. Sonar's {@link CodeBuffer} does not
* support {@link CharSequence#subSequence(int, int)} for no good reason that I
* can see... So this is basically a {@link CharSequenceInputBuffer} with
* subsequence extraction rewritten.</p>
*
*/
@Immutable
public final class CodeReaderInputBuffer
implements InputBuffer
{
private static final ExecutorService EXECUTOR_SERVICE;
static {
final ThreadFactory factory = new ThreadFactoryBuilder()
.setDaemon(true).setNameFormat("linecounter-thread-%d").build();
EXECUTOR_SERVICE = Executors.newCachedThreadPool(factory);
}
private final CodeReader reader;
private final Future<LineCounter> lineCounter;
public CodeReaderInputBuffer(@Nonnull final CodeReader reader)
{
this.reader = Objects.requireNonNull(reader);
lineCounter = EXECUTOR_SERVICE.submit(() -> new LineCounter(reader));
}
@Override
public char charAt(final int index)
{
return index >= 0 && index < reader.length()
? reader.charAt(index) : Chars.EOI;
}
/**
* Returns the Unicode code point starting at a given index
* <p>If the index is greater than, or equal to, the buffer's length, this
* method returns -1.</p>
*
* @param index the index
* @return the code point at this index, or -1 if the end of input has been
* reached
*
* @throws IllegalArgumentException index is negative
*/
@Override
public int codePointAt(final int index)
{
final int length = reader.length();
if (index >= length)
return -1;
if (index < 0)
throw new IllegalArgumentException("index is negative");
final char c = reader.charAt(index);
if (!Character.isHighSurrogate(c))
return c;
if (index == length - 1)
return c;
final char c2 = reader.charAt(index + 1);
return Character.isLowSurrogate(c2) ? Character.toCodePoint(c, c2) : c;
}
@Override
public String extract(final int start, final int end)
{
final int realStart = Math.max(start, 0);
final int realEnd = Math.min(end, reader.length());
final CharBuffer buf = CharBuffer.allocate(realEnd - realStart);
for (int i = realStart; i < realEnd; i++)
buf.put(charAt(i));
return new String(buf.array());
}
@Override
public String extract(final IndexRange range)
{
return extract(range.start, range.end);
}
@Override
public Position getPosition(final int index)
{
/*
* A CodeReader column index starts at 0, not 1; we therefore need to
* substract one from the extracted position...
*/
final Position position
= Futures.getUnchecked(lineCounter).toPosition(index);
return new Position(position.getLine(), position.getColumn() - 1);
}
@Override
public String extractLine(final int lineNumber)
{
Preconditions.checkArgument(lineNumber > 0, "line number is negative");
final LineCounter counter = Futures.getUnchecked(lineCounter);
final Range<Integer> range = counter.getLineRange(lineNumber);
final int start = range.lowerEndpoint();
int end = range.upperEndpoint();
if (charAt(end - 1) == '\n')
end--;
if (charAt(end - 1) == '\r')
end--;
return extract(start, end);
}
/**
* Get the index range matching a given line number
*
* @param lineNumber the line number
* @return the index range
*/
@Override
public IndexRange getLineRange(final int lineNumber)
{
final Range<Integer> range
= Futures.getUnchecked(lineCounter).getLineRange(lineNumber);
return new IndexRange(range.lowerEndpoint(), range.upperEndpoint());
}
@Override
public int getLineCount()
{
return Futures.getUnchecked(lineCounter).getNrLines();
}
@Override
public int length()
{
return reader.length();
}
} One more reason that I fail to see the non implementation of |
As to this:
would you please elaborate? The interface is quite simple to implement. In fact, |
Yes, I can elaborate: consumer of "subSequence" instead creation of copy keeps reference on result as is, and result survives multiple parses of different files, accumulating content of those files in memory. |
Then why don't you just return a I can modify the patch to do that, too... |
Just copy the necessary contents into a new char[] and return it in a String, which implements CharSequence.
Ate my own dog's food, new patch. (oh, and tests still pass) |
Further note about the patch: it fully obeys the contract in that if an index is passed which is out of bounds, or start is greater than end etc, it will return an |
Creation of String - is a copy-on-read, which might be undesirable in scenarios, where expected that method "subSequence" guarantees constant time. That's why I asked to provide usage scenarios, which are supposed to be not just a dump of some code, but an explanations (data, data flows, workloads, performance requirements, ...), which would help to choose solution / implementation approach, instead of trying quickly randomly find "correct" hot patch, whose behavior we'll be forced to support as soon as consumers will start relying on it.
|
It does. I speak from experience here. A I discovered this fact while working on largetext, and tried to print what was actually matched... I didn't implement |
Ok, excuse me, usage of |
OK, fine, then the decision is yours. I believe my patch is pretty simple to be included as is, since the results of it are basically stateless. |
Some more comments...
Let's be realistic here for a moment. Can a copy of some arbitrary amount of data be in constant time, that is O(1)? The answer is obvious: NO. It can never be. The only condition where it can be is that if a reference to the original data is kept, and THAT is the scenario which is undesirable. While my first patch was indeed "constant time" in that regard, and undesirable, the latest patch is NOT constant time (its runtime depends on the length of the requested subsequence). But both patches cure the fact that CodeBuffer, as it is now, breaks the CharSequence contract; with my patch, it obeys it to its fullest extent, bounds checking and all, for no additional cost whatsoever, except for the guys like me who do happen to rely on CharSequence.
Well there you have one scenario. That dump of code as you call it is precisely the critical part in all of my parsin process. And there is no data flow/performance requirements to speak of: just the fact that if an interface is claimed to be implemented, IT SHOULD BE. And my fix for that is simple, obvious and comes at no additional cost. So, what's the problem? |
e1979f9
to
9a641a5
Compare
There is no reason why CodeBuffer isn't able to support it.
Just use a CharBuffer to .wrap() the buffer, and use this CharBuffer's
.subSequence() with the indices adjusted according to the buffer's current
position.