-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[question] concurrent org.apache.jena.graph.Graph
implementations
#1961
Comments
org.apache.jena.graph.Graph
implementationsorg.apache.jena.graph.Graph
implementations
Any Graph or Dataset that uses our Now different implementations offer differing levels of concurrent access so the choice of implementation is going to depend on your usage pattern. For example If you can provide more detail on how you use the |
Well, the code is open for both ONT-API and concurrent-rdf-graph. Also, I don't quite understand how transactions could help if we have Streams and don't want to put everything in memory. In concurrent-rdf-graph there are tests. In any case, it seems to me that an additional mechanism will not hurt. |
Thanks for raising this issue. Jena provides:
With transactions, iterators from this graph are consistent - they iterator over the data at the time
Could you expand on that? Why "can't"? There are various problems that arise that ACID transactions address. Protecting the graph datastructures is one part of that - having a consistency view of the data is another.
These seem to protect individual operations but not a sequence of operations (the A in ACID). E.g. adding several triples. So datastructure are protected but application can see half-completed changes. I'm not sure the |
OWLAPI provides R\W locking mechanism, so we should also support this functionality in ONTAPI, we already has concurrent
You can comment out this method and run tests. I think the tests will fail. Method But off course there could be mistakes and possible improvements, this implementation can be considered as a draft.
yes, there is an issue about protection RDF-Model view: owlcs/ont-api#46 |
From the Jena project point of view, it's not desirable to add implementations that support one or two uses if there is a way to use the general machinery. The project fills up with random unloved code with a maintenance cost. Jena already has It is better than MRSW that
It is thread-safe at the moment. It the moment iterators will see complete write transaction atomically appear between The only thing I see missing is have a consistent It isn't even at risk of blocking other threads if the iterator is not closed; a write on the same thread will fail at the point of the write. If the application uses explicit transactions itself, it will work as well. Fuseki operations all use transactions. |
I would not compare
I think ONT-API library users can use
Ok, I got the point. If it is the final resolution, then the issue, I think, can be closed. |
The trouble with locks is deadlocks - both absolute ("deadly embrace") and under load ("deadlock by congestion"). Don't mix "simple" and "lightweight" :-) Java locks are not lightweight. Their contract on the memory model is expensive to the whole application. It's the first generation memory model for Java. It's not a great fit to model processor architectures.
The datastructures behind GraphTxn are a Java port of the structures in Scala. I don't know ONT-API in detail - what is the contract for the application? |
In OWL-API (and ONT-API), there is a concurrent
Maybe "contract" is not quite the right term, sorry for the misleading. Concurrent manager is used, for example, in Protege (btw I have RDF-based fork of Protege with ONTAPI inside, as an example of ONTAPI use). Not sure concurrent manager in Protege is a correct solution (there are only two thread - main & EDT. it is a swing application).
I should take a closer look at it. Maybe it would be a new feature in ONTAPI. +
Maybe it is second generation? From [wiki] (https://en.wikipedia.org/wiki/Java_memory_model):
|
The discussion here is about general support. If graphs implementations are in ONT-API, they can do what is required for that module. |
I see I think, due to this fact it is not very suitable for the library ONT-API, it is intended to wrap any graph, providing concurrent access out-of-box (maybe important, in the ONT-API there are two ways: concurrent and non-concurrent) + I see lack of documentation, nothing is said about the fact that this graph is thread-safe. |
I did a quick research, it shows that in some cases |
GH-1961: Thread-safe and consistent GraphTxn.find
Is it closed intentionally or accidentally (due to commit message)? It is OK for me if Jena doesn't need this thread-safe wrapper (lightweight and simple in my opinion), but it might be better if an explicit final resolution was written. |
Side-effect of the PR. Reopened. |
GraphTx does more. Because that includes multiple true concurrent views of the data it will cost more. It does not require the Kotlin runtime. This is like "autocommit" for SQL databases. It has all the consequences of that as well. It has more overhead, and does give application change consistency (e.g. a read-modify-write) is two transactions and other transactions/thread may change or view the database between the "read" and the "write"). A back account with $1000. To add to the account, an app reads the amount, adds the extra amount, deltes the old value and writes the new value (let' assume there is a modify does delete-add atomically and not get into the fact Thread A adds $500, thread B adds $250, what is the account now? With just safe read-modify operations, the answer is one of $1250, $1500, and $1750 depending on the way the operations interleave. Any solution to the concurrent read-modify-write needs the application to indicate the start and finish of
|
I understand your example. Of course, thread-safe Graphs cannot replace transactional ones. I think they have different fields of application. Analogy: It seems to me that the main question in this issue is whether it will be useful to someone or not. If not, then perhaps we shouldn't add this functionality to Apache Jena. There is definitely one example of use: ONTAPI. I'm thinking of other examples, but so far such examples are purely speculative. For example, we could collect a graph from various sources simultaneously and display it as a tree or a set of axioms on various devices without consistency. Perhaps even co-edit (but it seems in this case a transactional graph is more suitable). Maybe we could use a thread-safe graph as a kind of cache. The And of course, a thread-safe graph-wrapper has some advantages over a transactional one - performance and the ability to wrap any graph. Maybe it would be useful for someone. I asked a colleague about additional examples, if we come up with something, I will edit this comment. UDP-1: Found one real example from business project: we collect data from different sources (cockroach db) in parallel and then write data to RDF Graph. Maybe right now the performance of UDP-2: Asked ChatGPT:
|
Although I initially didn't plan to share this, given our ongoing discussion, I believe it's appropriate... I'm currently developing a thread-safe graph for Jena, which supports a multiple-reader plus single-writer (MR+SW) model. The existing implementation in Jena is based on Dexx collections, which are "a port of Scala's immutable, persistent collection classes to pure Java." This approach has numerous advantages, such as simple implementation, minimal locking requirements, and robustness. However, its main drawback is its poor performance and heavy reliance on the garbage collector. My goal is to offer an alternative that should provide better performance. However, as this new model isn't based on immutable persistent collections, it's quite a delicate endeavor. These graphs could form the foundation for an alternative thread-safe Dataset implementation. Therefore, they could potentially be used in Fuseki. Please note that this is a work in progress and may take a few more weeks (or potentially months) to develop. It's still in an early stage and may need several rounds of refactoring. The current state of development suggests that my plan can be successful. Even considering the overhead of locking, thread-local variables, and some duplicate operations, this implementation can significantly outperform existing ones (as always, this will depend on specific use cases). For anyone who can't resist a sneak peek, you can check out the progress at: Please be patient with me if my approach ends up not being successful. I'm doing my best to tackle this, but there's a high risk of failure. |
wrote benchmarks:
I think concurrent benchmarks are not fully correct (if somebody knows how to write them correctly please give me feedback). @arne-bdt let me know please when you finish your implementation, will include it in this benchmarks. UPD: added |
@sszuev, I would greatly appreciate it if you could include my implementation in your benchmarking tests. My own benchmarks and tests suggest that this implementation of a SWMR+ graph will soon be viable for deployment with our clients. However, it's important to note that testing and documentation are still works in progress, and some refactoring is anticipated. Despite these ongoing developments, the implementation is functioning as intended and meets our speed requirements. In our use case, we manage a graph with approximately 600,000 triples. We face a demanding scenario where 375,000 triples are updated every three seconds, alongside smaller updates involving about 100 triples every second. This occurs concurrently with eight threads, each reading all triples once per second. On my new notebook, which has sufficient processing power, I've been able to simulate this scenario in a single unit test. You can find these tests here: To run these tests, you'll need to check out my SMMR_Dataset branch to get it running. |
@afs @arne-bdt In my tests I can overcome this limitation by wrapping txn-graph with |
Is that stacktrace for the GraphTxn case? I can't see where the "find" step is. Why is the code writing inside a The most robust solution is to isolate the find. It adds cost only if iterators are opened and not fully used (e.g. implementations of "contains" would need checking). |
@sszuev: You would also need to consume the iterator within the transaction, Just Like you did with the .toList() call. After the transaction is closed, the iterator is no longer valid. I wanted to mention this, because it is not true for GraphTxn, as far as I know. |
yes. you can find code here: https://github.com/sszuev/concurrent-rdf-graph The stacktrace above is from single-thread test in a scenario where read operations followed by write one @arne-bdt Let me remind that I have neither implementations nor tests of a full-fledged ACID. |
@afs |
@afs I did the same thing :-) @sszuev Your benchmarks seem to test for throughput in different scenarios. If I understand the code correctly, there is always one graph to start with and then multiple threads doing almost the same operations (sometimes triples depend on tread or iteration number) working on that same graph in multiple iterations. Most of them mix read and write operations. In my implementation, the semapthore for the write locks ist not fair (maybe I should be). |
The tests are based on various scenarios that are executed in multithreading (java-executors and kotlin-coroutines in tests - it does not matter) and in cycles. There are Benchmarks also contains functional, i.e. classic, benchmarks, where single operation is measured. @arne-bdt I added your implementation to benchmarks and tests suites. I ran benchmarks with the parameters And important note: there are failures when running WRAPPER_TRANSACTIONAL2_GRAPH, so not all benchmarks have been collected.
|
@sszuev From my understanding, your tests and benchmarks are tailored for graphs that remain stable when accessed concurrently by multiple threads. However, I'm puzzled about their practicality without transaction support. For instance, consider the following code snippet which lacks transactional context, leading to no guaranteed consistency between successive method calls to the graph:
In my branch, transactional, I've incorporated a transaction context around your benchmark and test code where it seemed fitting. The wrapper is coded like this:
Below are my benchmark results for GRAPH_WRAPPER_TRANSACTIONAL and TXN_GRAPH: They look pretty bad for GRAPH_WRAPPER_TRANSACTIONAL and do not match my experience in real life scenarios. |
Practical use is discussed above. I suggested considering adding a new functionality to Jena - a concurrent graph. It is needed for ONT-API. Additionally, I have provided the answer from ChatGPT about when a non-transactional concurrent graph can be used. In short, the relationship between a concurrent and a transactional graph is similar to the relationship between a JDK's
This code is already present in the It is made via kotlin's extension functions, fun <X> Graph.transactionWrite(action: Graph.() -> X): X {
try {
startWrite()
return action()
} finally {
endWrite()
}
}
private fun Graph.startRead() {
if (this is Transactional) {
this.begin(TxnType.READ)
}
}
private fun Graph.endRead() {
if (this is Transactional) {
this.end()
}
}
fun createFrom(source: Graph): Graph {
val res = createNew()
res.transactionWrite {
source.find().forEach { res.add(it) }
}
return res
} |
@sszuev The idea of treating your thread-safe graphs like concurrent collections has also inspired me for my current projects at work. I want to cache real-time measurements in graphs. Maybe I don't need transaction safety in this scenario and then hopefully your graphs and benchmarks will be helpful. Thanks again! |
…rentModificationException (see apache#1992, apache#1961)
Version
4.x.x
Question
In ONT-API we need thread-safe graph.
For this purpose, I created a separate simple library: https://github.com/sszuev/concurrent-rdf-graph
It contains
SynchronizedGraph
&ReadWriteLockingGraph
.It would be convenient to have only Jena in dependencies (instead of Jena + this library).
So, the question to Apache Jena community: do we need such kinds of Graphs in Jena?
Even we don't need, I think it is good to have this issue for a record.
The text was updated successfully, but these errors were encountered: