-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
questions about _split and _reduce #1
Comments
The _split() and _reduce() functions are an extra feature I added at the last minute to automate reducing the number of candidates. It is based off the Scree plot image at https://github.com/goldmonkey21/doxer/blob/main/images/scree-plot-gavin-satoshi-2.png So what I was doing is finding a spot to cut off the number of candidates. The scree plot can be easily done by eye finding the spot where the line dramatically changes (e.g. in the graph the line changes dramatically after Gavin, Hal, and Lachesis). But I wanted to automate this because the doxer algorithm works better with a reduced candidate list.
This is the result of an automated Scree plot. It is a rough algorithm but in this case it actually cut off the line after Gavin and Lachesis, leaving Hal and the other candidates out of the final Doxer algorithm. The _reduce() function reduces the list of delta numbers down to the distance between each delta number (i.e. to find the sudden change in length of the Scree plot). The _split() function finds the index number of the delta results where you need to split the list. There is a bit of computation here because it has to find where in the main delta list it would need to cut off the list given where the longest line in the Scree plot was. So back to the Scree plot image, the _reduce() function would get the distance between Gavin and Hal which is quite long, then it would get the distance between Hal and Lachesis which is actually shorter, it keeps both of these lines because the second one is shorter than the first. Then it gets the line between Lachesis and Btchris and drops it there because it is so massive. The _split() function then loops over these distances and returns an integer index number so that you can go back to the original list of deltas (not the distances between deltas) and return the output
Good question. Yes it originally was and you could just type
I scraped the text from Bitcointalk forum using requests and beautiful soup libraries. I did remove quotes from the raw text using beautiful soup to automatically find the quote tags and remove them. I can provide you with the original script so that you can scrape more texts if you want. Also, I have a new original algorithm I am working on right now and hopefully will release it soon. I'll put an announcement for it on the readme.md page as soon as it is done. |
hi,
I'm trying to understand the code, and I have few questions:
if the output for the command doxer.py -t 3satoshi is
I don't understand the line
corpus = corpus[:_split(_reduce(corpus))]
I don't get what _split/reduce do.
isn't
corpus = sorted(corpus,reverse=False)[:args.reduce]
enough to reduce the corpus with the give option?Another question: where did you take the forum text? did you remove the quotes? if someone replies to another person in the forum, the original message can be quoted, making the text of 2 people appear in the same message, can this be a problem?
thanks
The text was updated successfully, but these errors were encountered: