Unable to Crawl JS Dependent Sites #3

Texan1835 · 2020-06-04T19:39:14Z

Ran this scraper exactly as it was created, only modified path from logs to a .txt in a Windows folder. Captures about half the email addresses on a given webpage, but never captures phone numbers. Running code against an entire website, not just a single webpage. Error seems to occur even when I run it against a single webpage with multiple phone numbers listed.
Windows 10, python 3.8, pyCharm.
Please note - I'm a newbie to python, so it's possible the error is on my end.

Edit: Ran scraper against this link because it has lots of phone/email: https://www.hamradio.com/contact.cfm

Result:

`Crawling https://www.hamradio.com/contact.cfm

Emails:

Phone Numbers:

Process finished with exit code 0`

z7r1k3 · 2020-06-04T20:32:29Z

Are those phone numbers linked? i.e. If you view the source, does it have an href="tel:1234567890"?

Currently, only linked phones/emails are supported, but I do plan to eventually add support to search the entire page for anything that looks like a phone/email, linked or not.

Texan1835 · 2020-06-04T20:37:33Z

It is not formatted like that. Uses br tags.

Code for phone looks like this on the webpage:
`

Phone: 713-533-7373

Toll Free: 800-854-6046

`

Email code:

`
[email protected]

`

z7r1k3 · 2020-06-05T00:04:20Z

Unfortunately the crawler doesn't support scraping for plaintext phones/emails yet, although that is on the to-do list. For now it has to be an actual tel or mailto link.

As for that .cfm link, since .cfm isn't added to the whitelist, it's treating it as an unsupported filetype. I'll go ahead and add it, but you should be able to put that in as the original scraping URL as a workaround. Is that what you tried and did it still not work?

z7r1k3 · 2020-06-05T01:46:52Z

After debugging, the crawler is unable to view the webpage because it requires JavaScript.

As such it would appear this site (and any site like it) is unsupported. I may add a fix for this in the future if it becomes common enough, but I'll need to deep dive it a bit.

This is all the HTML the crawler gets to see:

<html><title>You are being redirected...</title>
<noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript>
<script>var s={},u,c,U,r,i,l=0,a,e=eval,w=String.fromCharCode,sucuri_cloudproxy_js='',S='eD0nMG1TYicuc3Vic3RyKDMsIDEpICsgJycgKycnKyJlIi5zbGljZSgwLDEpICsgIjRzdWN1ciIuY2hhckF0KDApKyJjbCIuY2hhckF0KDApICsgICcnICsgCiIwc3VjdXIiLmNoYXJBdCgwKSsnYScgKyAgICcnICsnMmEnLnNsaWNlKDEsMikrJzMnICsgICJmIiArICIiICsiM3N1Ii5zbGljZSgwLDEpICsgImNzdWN1ciIuY2hhckF0KDApKyIiICsndUc5Jy5jaGFyQXQoMikrJz1mJy5zbGljZSgxLDIpK1N0cmluZy5mcm9tQ2hhckNvZGUoMHgzNCkgKyAnNicgKyAgJzAnICsgICAnJyArJzQnICsgICI2ayIuY2hhckF0KDApICsgICcnICsgCiI5dyIuY2hhckF0KDApICsgIjgiICsgIjIiICsgIiIgKyczJyArICBTdHJpbmcuZnJvbUNoYXJDb2RlKDB4MzUpICsgIjgiLnNsaWNlKDAsMSkgKyAgJycgKyAKIjRzdWN1ciIuY2hhckF0KDApKyAnJyArJzAnICsgICI1bSIuY2hhckF0KDApICsgICcnICsgCiJhc3UiLnNsaWNlKDAsMSkgKyAiIiArU3RyaW5nLmZyb21DaGFyQ29kZSgweDM0KSArICdWeD4wJy5zdWJzdHIoMywgMSkgKyAnJyArImZzdWN1ciIuY2hhckF0KDApKyJjIiArICAnJyArJyc7ZG9jdW1lbnQuY29va2llPSdzc3VjdXJpJy5jaGFyQXQoMCkgKyAndXMnLmNoYXJBdCgwKSsnYycrJ3VzdScuY2hhckF0KDApICsnc3VjdXJyJy5jaGFyQXQoNSkgKyAnaScrJ19zdScuY2hhckF0KDApICsnc3VjdXJpYycuY2hhckF0KDYpKydsJy5jaGFyQXQoMCkrJ29zdWN1Jy5jaGFyQXQoMCkgICsnc3UnLmNoYXJBdCgxKSsnc3VjdXJkJy5jaGFyQXQoNSkgKyAncCcuY2hhckF0KDApKydyJysnJysnc3VjdXJpbycuY2hhckF0KDYpKyd4c3VjdScuY2hhckF0KDApICArJ3lzJy5jaGFyQXQoMCkrJ18nKyd1JysndScrJ2knKydkJysnX3N1Y3VyJy5jaGFyQXQoMCkrICdiJysnc3VjdXJmJy5jaGFyQXQoNSkgKyAnOScrJzlzJy5jaGFyQXQoMCkrJ2JzdWN1cmknLmNoYXJBdCgwKSArICdlJy5jaGFyQXQoMCkrJ2NzdScuY2hhckF0KDApICsnNnN1YycuY2hhckF0KDApKyAnZnMnLmNoYXJBdCgwKSsiPSIgKyB4ICsgJztwYXRoPS87bWF4LWFnZT04NjQwMCc7IGxvY2F0aW9uLnJlbG9hZCgpOw==';L=S.length;U=0;r='';var A='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';for(u=0;u<64;u++){s[A.charAt(u)]=u;}for(i=0;i<L;i++){c=s[S.charAt(i)];U=(U<<6)+c;l+=6;while(l>=8){((a=(U>>>(l-=8))&0xff)||(i<(L-2)))&&(r+=w(a));}}e(r);</script></html>

z7r1k3 · 2020-06-05T01:58:17Z

Reopening since the OP stated that it successfully scraped emails from another site, but not the phone numbers.

I'm assuming it's because the phone numbers on the other site (not hamradio) were in plaintext, but I will give the OP a chance to respond on the off-chance there's something else going on here.

OP, can you provide the URL that "Captures about half the email addresses on a given webpage, but never captures phone numbers"? Or at least a snippet of the source code?

z7r1k3 · 2020-06-11T04:52:08Z

No reply. Closing as everything presented in this issue is not supported.

z7r1k3 · 2020-08-31T02:47:02Z

Reopening as, since this is a website with proper links, etc. in the HTML, it should be supported.

There is no timeline for fixing this issue, but it is officially on the agenda.

z7r1k3 closed this as completed Jun 5, 2020

z7r1k3 reopened this Jun 5, 2020

z7r1k3 closed this as completed Jun 5, 2020

z7r1k3 added the not supported label Jun 5, 2020

z7r1k3 reopened this Jun 5, 2020

z7r1k3 closed this as completed Jun 11, 2020

z7r1k3 reopened this Aug 31, 2020

z7r1k3 added bug Something isn't working and removed not supported labels Aug 31, 2020

z7r1k3 self-assigned this Sep 3, 2020

z7r1k3 closed this as completed Sep 3, 2020

z7r1k3 reopened this Sep 3, 2020

z7r1k3 changed the title ~~Phone numbers not pulling into file~~ Unable to Crawl JS-Dependent Sites Sep 4, 2020

z7r1k3 changed the title ~~Unable to Crawl JS-Dependent Sites~~ Unable to Crawl JS Dependent Sites Sep 4, 2020

z7r1k3 added enhancement New feature or request and removed bug Something isn't working labels Mar 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Crawl JS Dependent Sites #3

Unable to Crawl JS Dependent Sites #3

Texan1835 commented Jun 4, 2020 •

edited

Loading

z7r1k3 commented Jun 4, 2020

Texan1835 commented Jun 4, 2020 •

edited

Loading

z7r1k3 commented Jun 5, 2020 •

edited

Loading

z7r1k3 commented Jun 5, 2020 •

edited

Loading

z7r1k3 commented Jun 5, 2020 •

edited

Loading

z7r1k3 commented Jun 11, 2020

z7r1k3 commented Aug 31, 2020

Unable to Crawl JS Dependent Sites #3

Unable to Crawl JS Dependent Sites #3

Comments

Texan1835 commented Jun 4, 2020 • edited Loading

z7r1k3 commented Jun 4, 2020

Texan1835 commented Jun 4, 2020 • edited Loading

z7r1k3 commented Jun 5, 2020 • edited Loading

z7r1k3 commented Jun 5, 2020 • edited Loading

z7r1k3 commented Jun 5, 2020 • edited Loading

z7r1k3 commented Jun 11, 2020

z7r1k3 commented Aug 31, 2020

Texan1835 commented Jun 4, 2020 •

edited

Loading

Texan1835 commented Jun 4, 2020 •

edited

Loading

z7r1k3 commented Jun 5, 2020 •

edited

Loading

z7r1k3 commented Jun 5, 2020 •

edited

Loading

z7r1k3 commented Jun 5, 2020 •

edited

Loading