Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically generate topics and keywords #47

Open
RichardLitt opened this issue Oct 31, 2017 · 4 comments
Open

Automatically generate topics and keywords #47

RichardLitt opened this issue Oct 31, 2017 · 4 comments

Comments

@RichardLitt
Copy link
Member

RichardLitt commented Oct 31, 2017

This will involve a couple of things. First, parsing the README. Second, finding the Description or Background section. Then, either topic extraction or NER of that information, with the goal of seeing if you can automatically suggest topics for the README.

For now, noun phrases may do the trick, in the description, for suggestions. This would be greatly aided by a test database of repositories and topics, however.

@RichardLitt
Copy link
Member Author

It may be possible to bootstrap a learning corpus with this list of topics: https://github.com/github/explore.

@andrew
Copy link

andrew commented Nov 7, 2017

A low tech way for projects published to a package manager that supports keywords would be to pull the existing ones from the keywords or tags fields.

I did experiment with pulling interesting words from readmes and descriptions in the Libraries.io codebase using a ruby library called highscore but removed it a while back as the result we're great and it was pretty slow to be running as part of the critical path inside the rails app, main code was here: https://github.com/librariesio/libraries.io/blob/7a15048fe7135052dc3ac9383d13833b5cb1f85b/app/models/readme.rb#L75-L79

@RichardLitt
Copy link
Member Author

A low tech way for projects published to a package manager that supports keywords would be to pull the existing ones from the keywords or tags fields.

Yeah, I already do that for projects which have manifests. I'm trying to think of a better way to extract. I think not using the entire README - just the description and background sections - should help.

I'm going to make a package now to automatically cross-check with topics from github/explore. Might be a solution while we don't have an API for suggesting topics yet from GitHub.

Thanks for the help! Slowness isn't an issue for me, this will be pretty fast I think.

@RichardLitt
Copy link
Member Author

I've started work on this, here: Katahdin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants