Replies: 4 comments 1 reply
-
Welcome @suvalaki Please provide more context. a) Which Scraping Code are you looking at? From my understanding we have a legacy scraper which was built on Selenium I don't believe that's in use currently by default. Instead, have a look here at:
From my understanding we're currently leveraging Langchain's WebBaseLoader by default Are the Langchain Loaders the underlying methods that you need to refactor? Please feel free to flesh out the use cases you need to handle & which methods & classes you need to refactor to achieve what you need. If you feel there is room for improvements we are always open to exploring PR's with new ideas As they say in the movie Tron: "The thing about perfection is that it's unknowable." Therefore, software always has the potential to improve. |
Beta Was this translation helpful? Give feedback.
-
Ok i will make an effort to comment on where i think things could use improvement. Im welcome to you saying im off my rocker here. this is obviously an incredible project (even though im a little frustrated this afternoon). https://github.com/assafelovic/gpt-researcher/blob/master/gpt_researcher/scraper/scraper.py#L26 Specifically here we are seeing that the session is hard coded. It is difficult therefore to manage either the bs4 or langchain variants of the getters sessions. Furthermore because of this we cant monkey patch requests. This scraper map is an issue: https://github.com/assafelovic/gpt-researcher/blob/master/gpt_researcher/scraper/scraper.py#L73
You basically need to vendor ALL of the code in agent and point the one method elsewhere. Recommendations.
And if i was writing this myself i might even have used composition instead of straight methods to create instances of the member objects during GPTResearcher initialisation so that those mechanisms could have been modified as strategies. |
Beta Was this translation helpful? Give feedback.
-
im happy to contribute. I will probably need to find time in the back half of this week. I should have a reasonable idea what needs to be done since ive been vendoring this today. Im not on the discord. Ill look to connect once i free up a bit more |
Beta Was this translation helpful? Give feedback.
-
@suvalaki thanks for this great feedback. Your feedback is exactly why this is open source and a community purpose project. Agree with the fallbacks and would love your help and contribution around it. Seems like you're the expert for this improvement, wdyt? |
Beta Was this translation helpful? Give feedback.
-
Explain to me if im missing something about the code, im having some real trouble passing a requests session through to the scraper.
In general:
To me all this seems really bad. Im literally being forced to rebuild the entire classes and swap out methods. Is this a problem with me?
Beta Was this translation helpful? Give feedback.
All reactions