Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup #42

Open
timrobertson100 opened this issue Feb 28, 2018 · 16 comments
Open

Cleanup #42

timrobertson100 opened this issue Feb 28, 2018 · 16 comments

Comments

@timrobertson100
Copy link
Member

This project is a mess. There are comments about classes being removed in the near future and it is not clear how it is now meant to be used.

We should remove what can be and clean up. You get the sense it is overly complex for what should be a rather simple library

@cgendreau
Copy link
Contributor

I'm happy to help. But I would be happier if the comment was more constructive.

@timrobertson100
Copy link
Member Author

timrobertson100 commented Feb 28, 2018

Fair point - and thanks! I was going to tackle this, but here is what I'd suggest - if sensible, we can create separate issues, but perhaps you know which classes can simply be removed to slim it down? (or marked as @deprecated with a comment to be removed in 1.33.

  • All classes that are not meant to be used should be removed (e.g. those with "Warning: this class will probably be removed shortly"). This will remove the biggest amount of confusion.
  • It is not clear why there are dwc and dwca packages. Especially when the dwc package contains things like Extensions which are only relevant for DwC Archives. Suggest refactor into something intuitive - perhaps only having packages of utils, xml, read, write or something along those lines
  • The DarwinCoreRecord and DarwinCoreTaxon haven't been updated to follow the changes in the standard, so should probably be removed (they'll still exist in old versions of course)
  • The API returns String for all values seemingly ignoring the delimitedBy. That should be documented if it is simply ignored (which as far as I can tell it is)
  • ArchiveFactory is not deprecated but it is not clear when that should be used over DwcFiles. If not necessary and is to be removed then things like DownloadUtils can also be removed.
  • StarRecordIterator has a comment with a TODO suggesting this can provide incorrect results if the comparator is not provided, but no issue exists for it - it's not clear whether that is something to be concerned with not. Suggest creating an issue and removing the comment as it causes confusion.
  • All classes should have a Javadoc (e.g. NormalizedDwcArchive explaining what normalized means, IntSequenceGenerator which I suspect should be removed)

cgendreau pushed a commit that referenced this issue Mar 1, 2018
@cgendreau
Copy link
Contributor

The "tabular" package was initially created as a solution for Issue #26 but was later replaced by what DwcFiles provides. The reason why it was still there once DwcFiles was created is simply that the validator was still based on it. After review we decided to also use the DwcFiles in the validator to keep consistency with the crawler and the poor package remained here, alone.

@cgendreau
Copy link
Contributor

The initial reason why ArchiveFactory was not deprecated is that DwcFiles doesn't apply validation on the Archive (see validateArchive method in ArchiveFactory) so DwcFiles was not a direct replacement. But now, I would say it could be deprecated. What do you think @mdoering ?

@cgendreau
Copy link
Contributor

It is not clear why there are dwc and dwca packages.

The goal (no trace/issue about it sorry) was to move everything in dwc and remove dwca. I would suggest keep the dwc package since an object from org.gbif.read.Archive is not really clear but I would definitely rework the packages to remove io and dwca.

@mdoering
Copy link
Member

mdoering commented Mar 1, 2018

Supporting all the refactoring ideas. Just checked and my code still uses ArchiveFactory everywhere. Would be good to have just one factory, whichever it is I don't care. And validation could be a separate thing - if its needed at all

@timrobertson100
Copy link
Member Author

timrobertson100 commented Mar 2, 2018

@mdoering - would you prefer that we incorporate the refactor so that the ArchiveFactory API remains? I think that would be doable and might make adoption easier.

@cgendreau
Copy link
Contributor

you mean to keep ArchiveFactory or to move its functionalities to DwcFiles?

@timrobertson100
Copy link
Member Author

Was asking if he'd prefer to keep ArchiveFactory public methods replacing their functionality with that of the DwcFiles. I expect a lot of codebases might be using it.

@cgendreau
Copy link
Contributor

ok, but it's already like that see https://github.com/gbif/dwca-io/blob/master/src/main/java/org/gbif/dwca/io/ArchiveFactory.java#L64

Unless I misunderstood.

@mdoering
Copy link
Member

mdoering commented Mar 4, 2018

I reckon its simpler to keep ArchiveFactory instead of DwcFiles as its used a lot already and the main entry into the library. But its not a big deal for my codebase to switch to DwcFiles - whatever you guys prefer. But I also would like to see just one factory

@timrobertson100
Copy link
Member Author

timrobertson100 commented Mar 4, 2018 via email

@cgendreau
Copy link
Contributor

I would suggest to keep DwcFiles and deprecate ArchiveFactory (otherwise I would have not create DwcFiles initially).

Considering ArchiveDwcIterator and ArchiveIterator were deprecated in 1.32 deprecating ArchiveFactory seems consistent to me. We are talking around 20 instances to update outside this project it self.

@MattBlissett
Copy link
Member

Just an FYI:

In the last month, developers from institutions in Japan, France, Germany (not Markus), Sweden, Britain, Mexico, the Netherlands and a cloud build server have downloaded the artefact from us.

MattBlissett added a commit that referenced this issue Mar 20, 2018
Discussion at #42
@MattBlissett
Copy link
Member

@cgendreau, what was the thinking behind deprecating the implements Iterable<StarRecord> of Archive?

The old way:

Archive arch = ArchiveFactory.openArchive(new File("archive.zip"));
for (StarRecord rec : arch) {
  ...
}

seems nicer than having to create a NormalizedDwcArchive yourself.

Would it be nice if we could do this?

Archive arch = DwcFiles.fromLocation(Paths.get("archive.zip"));
for (StarRecord starRec : arch) {
  ...
}
for (Record coreRec : arch.getCore()) {
  ...
}
for (Record extRec : arch.getExtension(whatever)) {
  ...
}

@cgendreau
Copy link
Contributor

cgendreau commented Mar 26, 2018

The main reason behind NormalizedDwcArchive was to make it more explicit that it is not simply a matter of returning an iterator but it may block if sorting is required. I felt it was kind of strange to have a for blocking for 2 minutes because we need to sort a huge file internally so hiding the details was not necessarily helping the user. Also, we should only use a StarRecord iterator if we really need the extensions because there is an additional cost. Having Archive implementing Iterable<StarRecord> makes StarRecord iterator the default iterator.

In short, I believe that sorting all files (core +ext) is not really the task of an iterator and that it should only be done if it's really required.

MattBlissett added a commit that referenced this issue Mar 27, 2018
This is unlikely to happen, so I won't fix it or even detect it; just
document it.

Relevant to #42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants