Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent Headless Timeout Error on Non-Local Environments #491

Open
mamuchastegui opened this issue Jul 26, 2024 · 0 comments
Open

Intermittent Headless Timeout Error on Non-Local Environments #491

mamuchastegui opened this issue Jul 26, 2024 · 0 comments

Comments

@mamuchastegui
Copy link

Describe the bug
I have this error:
"No HTML body content found, please try setting the 'headless' flag to False in the graph configuration. HTML content: Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to 'https://clarin.com/', waiting until 'load'."

This only occurs in non-local environments and the error is intermittent; it had happened about a month ago and has now occurred again. When I use headless mode, this error appears, and it seems to be something that broke with the latest changes.

To Reproduce
Domain: "https://clarin.com/"
Prompt: "List me 3 product images with their title, image_url and url."

import asyncio

from scrapegraphai.graphs import SmartScraperGraph

from app.infrastructure.config.env_config import env
from app.infrastructure.logging import logger

graph_config = {
    "llm": {
        "api_key": env.OPENAI_SCRAPEGRAPH_API_KEY,
        "model": "gpt-4o",
    },
    "headless": True,
    "verbose": False,
    "timeout": 60000,
    "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36",
}


async def scrapegraph_products(domain_url: str, prompt: str):
    logger.info(f"scrapegraph_products - domain: {domain_url}, prompt: {prompt}")
    scrape_graph = SmartScraperGraph(
        prompt=prompt,
        source=domain_url,
        config=graph_config,
    )

    loop = asyncio.get_running_loop()
    result = await loop.run_in_executor(None, scrape_graph.run)

    logger.info(f"scrapegraph_products result: {result}")
    return result

Expected results

{'products': [{'title': 'La ceremonia de apertura de los Juegos Olímpicos 2024 en las mejores fotos', 'image_url': 'https://www.clarin.com/img/2024/07/26/KIQdkZVmE_290x290__1.jpg', 'url': 'https://clarin.com/fotogalerias/ceremonia-apertura-juegos-olimpicos-2024-mejores-fotos-paris-2024-jjoo-2024-argentina_5_4CHQW7XKlw.html'}, {'title': 'Santiago Lange y un emotivo reconocimiento previo a la ceremonia de apertura de los Juegos Olímpicos: fue relevo de la antorcha de atletas legendarios', 'image_url': 'https://www.clarin.com/img/2024/07/26/rn-hNvDc4_600x290__2.jpg#1722010337269', 'url': 'https://www.clarin.com/deportes/santiago-lange-emotivo-reconocimiento-previo-ceremonia-apertura-juegos-olimpicos-relevo-antorcha-atletas-legendarios_0_FWiwtIz4xh.html'}, {'title': 'Francis Ford Coppola en problemas: un video besando a extras que estaban en topless de su filme Megalópolis', 'image_url': 'https://www.clarin.com/img/2024/07/26/AcUPKcQmm_600x290__1.jpg', 'url': 'https://www.clarin.com/espectaculos/francis-ford-coppola-problemas-video-besando-extras-topless-filme-megalopolis_0_EJSTNUX8bl.html'}]}

Additional context
Dockerfile python:3.10-slim
Google App Engine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant