questions about _split and _reduce #1

davidenitti · 2021-11-21T00:00:55Z

hi,
I'm trying to understand the code, and I have few questions:
if the output for the command doxer.py -t 3satoshi is

0.0010055437039908227 -> 237lachesis_forum.txt                      
0.001015254105998512 -> 224GavinAndresen_forum.txt
224GavinAndresen

I don't understand the line corpus = corpus[:_split(_reduce(corpus))]
I don't get what _split/reduce do.
isn't corpus = sorted(corpus,reverse=False)[:args.reduce] enough to reduce the corpus with the give option?
Another question: where did you take the forum text? did you remove the quotes? if someone replies to another person in the forum, the original message can be quoted, making the text of 2 people appear in the same message, can this be a problem?
thanks

The text was updated successfully, but these errors were encountered:

goldmonkey21 · 2021-11-23T04:44:08Z

The _split() and _reduce() functions are an extra feature I added at the last minute to automate reducing the number of candidates. It is based off the Scree plot image at https://github.com/goldmonkey21/doxer/blob/main/images/scree-plot-gavin-satoshi-2.png

So what I was doing is finding a spot to cut off the number of candidates. The scree plot can be easily done by eye finding the spot where the line dramatically changes (e.g. in the graph the line changes dramatically after Gavin, Hal, and Lachesis). But I wanted to automate this because the doxer algorithm works better with a reduced candidate list.

0.0010055437039908227 -> 237lachesis_forum.txt                      
0.001015254105998512 -> 224GavinAndresen_forum.txt
224GavinAndresen

This is the result of an automated Scree plot. It is a rough algorithm but in this case it actually cut off the line after Gavin and Lachesis, leaving Hal and the other candidates out of the final Doxer algorithm. The _reduce() function reduces the list of delta numbers down to the distance between each delta number (i.e. to find the sudden change in length of the Scree plot). The _split() function finds the index number of the delta results where you need to split the list. There is a bit of computation here because it has to find where in the main delta list it would need to cut off the list given where the longest line in the Scree plot was.

So back to the Scree plot image, the _reduce() function would get the distance between Gavin and Hal which is quite long, then it would get the distance between Hal and Lachesis which is actually shorter, it keeps both of these lines because the second one is shorter than the first. Then it gets the line between Lachesis and Btchris and drops it there because it is so massive. The _split() function then loops over these distances and returns an integer index number so that you can go back to the original list of deltas (not the distances between deltas) and return the output print('{} -> {}'.format(x[0],x[1]))

isn't corpus = sorted(corpus,reverse=False)[:args.reduce] enough to reduce the corpus with the give option?

Good question. Yes it originally was and you could just type python doxer.py -r 3 -t 3satoshi. But at the last minute I added automatic Scree plot to get rid of the junk candidates and make it more consistent.

Another question: where did you take the forum text?

I scraped the text from Bitcointalk forum using requests and beautiful soup libraries. I did remove quotes from the raw text using beautiful soup to automatically find the quote tags and remove them. I can provide you with the original script so that you can scrape more texts if you want.

Also, I have a new original algorithm I am working on right now and hopefully will release it soon. I'll put an announcement for it on the readme.md page as soon as it is done.

davidenitti changed the title ~~questions about the delta about _split and _reduce~~ questions about _split and _reduce Nov 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

questions about _split and _reduce #1

questions about _split and _reduce #1

davidenitti commented Nov 21, 2021

goldmonkey21 commented Nov 23, 2021

questions about _split and _reduce #1

questions about _split and _reduce #1

Comments

davidenitti commented Nov 21, 2021

goldmonkey21 commented Nov 23, 2021