Skip to content
This repository has been archived by the owner on Jul 8, 2020. It is now read-only.

Alexa Top 1M is no longer. Replacement? #1

Open
igrigorik opened this issue Nov 21, 2016 · 19 comments
Open

Alexa Top 1M is no longer. Replacement? #1

igrigorik opened this issue Nov 21, 2016 · 19 comments

Comments

@igrigorik
Copy link
Contributor

igrigorik commented Nov 21, 2016

https://twitter.com/Alexa_Support/status/800755671784308736

image

Alternatives

Quantcast Top 1M US sites

Directly Measured: These properties allow Quantcast to monitor their traffic with Quantcast Measurement. Therefore, traffic numbers for these sites are highly accurate and not estimated. Learn more.

We collect directly measured data from the millions of web destinations and mobile apps controlled by Quantified publishers. All the data we collect is anonymous and contains no personally identifiable information (PII).

Majestic Million CSV

The Majestic Million is a list of the top 1 million website in the world, ordered by the number of referring subnets. A subnet is a bit complex – but to a layman it is basically anything within an IP range, ignoring the last three digits of the IP number.

ahrefs

Another crawl based service. I don't see any free ranking dumps though?

SimilarWeb Top sites

Top 50 for free.. requires paid account to see more.


Others we could or should consider?

@igrigorik
Copy link
Contributor Author

Cisco Umbrella 1 Million

With that, we are very excited to announce the Cisco Umbrella 1 Million — a free list of the top 1 million most popular domains. This project came from our most recent hack-a-thon where we had more than 300 participants across 3 different countries hack for 24 hours. Hack-a-thons are an important piece of our engineering team’s culture and showcases their passion to build. On the heels of the announcement that the Alexa 1 Million list was not going to be available for free anymore, the idea was that we have the data and the means to provide an alternative.

The list gets refreshed every 24 hours and will be accessible from the same location (URL)

Data: http://s3-us-west-1.amazonaws.com/umbrella-static/index.html

Looks very promising!

@RByers
Copy link

RByers commented May 21, 2017

What's the current status here? @pmeenan are we just using a stale Alexa list?

@igrigorik
Copy link
Contributor Author

@pmeenan yes. afaik Amazon is not updating the rankings any more, but the list is still there (as a download).

@hsbahri
Copy link

hsbahri commented Aug 29, 2017

Hi,
Is there a way I can get Alexa ranking list from 1 million to 10 million?

@pmeenan
Copy link
Member

pmeenan commented Aug 29, 2017

@hsbahri You are asking at the wrong place - you need to ask Alexa (or Amazon which owns it). We just use the public 1M list they used to provide.

@rviscomi
Copy link
Member

rviscomi commented Dec 18, 2017

Google recently released the Chrome User Experience report, which includes 908K distinct domains from 1.2M distinct origins. Looking at the diff between these domains and Alexa's, CrUX seems to be higher quality in some ways. For example it excludes the t.co link shortener and microsoftonline.com, which is just the domain for logged in users. Neither of these domains are useful for HA to crawl. It's also updated monthly. One big drawback is that CrUX isn't ranked, so we wouldn't necessarily know the relative popularity of domains. Something to consider though.

@rviscomi
Copy link
Member

Correction: The 908K number was for distinct domains excluding subdomains. The number of distinct domains including subdomains is more like 1.2M.

I've also been exploring a newer version of the Alexa list than the httparchive.urls.20170315 table on BigQuery. That list (see 20171221) does include subdomains, but that seems to lower the bar for uninteresting domains like client4.google.com which may be a popular CDN but not as a user-facing website. Using that list we only get ~135K unique domains that we can consider high-quality (CrUX-worthy).

One goal is to prune out the URLs in the current crawls that are low quality, eg not in CrUX. According to the following query, that would be about half of our dataset :-/

SELECT
  url
FROM
  `httparchive.runs.2017_12_01_pages`
WHERE
  NET.HOST(url) IN (
  SELECT
    DISTINCT NET.HOST(origin)
  FROM
    `chrome-ux-report.all.201711`)

217K URLs out of 470K (46%) are in CrUX.

@pmeenan
Copy link
Member

pmeenan commented Dec 21, 2017

Instead of joining the 470k that we currently pull, what does the intersection of the full 1M look like?

@rviscomi
Copy link
Member

Going by the latest Alexa list, much worse actually.

#standardSQL
SELECT
  rank,
  domain
FROM
  `httparchive.urls.20171221`
WHERE
  domain IN (
    SELECT
      domain
    FROM (
      SELECT
        domain
      FROM
        `httparchive.urls.20171221`) AS alexa
    JOIN (
      SELECT
        DISTINCT NET.HOST(origin) AS domain
      FROM
        `chrome-ux-report.all.201711`) AS crux
    USING (domain))
ORDER BY
  rank

134,768 domains.

If we use the 20170315 Alexa list instead, we get a much nicer number of 490,349 domains. However, this older list doesn't include subdomains, so we're missing out on a lot of important sites.

@pmeenan
Copy link
Member

pmeenan commented Dec 21, 2017

I kind of like the idea of using the intersection of the older list and the CrUX report. We were using the top 500k from the older list anyway so joining with CrUX would let us filter it to just the actual domains that serve pages (and puts us in the same ballpark).

Sounds like the newer domain lists are pretty much useless.

@rviscomi
Copy link
Member

If they increased the granularity and extended the list to ~5M I think it'd be very valuable. But instead they watered it down with 350K uninteresting CDN variations. :(

Using the older list SGTM. We would need to figure out the issue of taking an Alexa domain (with no subdomain or protocol) and picking the "canonical" origin from CrUX. In some cases this could be "prefer the origin with www and/or https" but it gets weird in other cases, like "for live.com prefer https://outlook.live.com".

One brutish approach could be a test crawl on the domains themselves and update the list with wherever the initial URL redirected to.

@pmeenan
Copy link
Member

pmeenan commented Dec 21, 2017

Couldn't we use *.live.com from CrUX and cover all navigated domains instead of trying to limit it to 1? Granted, that may change the counts. Basically look for an exact match as well as *.

@rviscomi
Copy link
Member

rviscomi commented Dec 21, 2017

To maintain the ranking integrity we'd need a 1:1 map of domain to origin. If we wanted to have unranked origins as well, I'd be ok with grabbing as many subdomains as we can accommodate.

@igrigorik
Copy link
Contributor Author

By the sounds of it, their new list is reporting top requested origins, which is a change from top navigated origins, and is thus far less interesting or useful for us. The corollary here is that the new ranks are also of little value to us moving forward. With that in mind...

Long term, I don't see what value we get out of the Alexa list anymore: the intersection is small and we don't trust the ranks. As such, it seems that we can sunset our use of Alexa sometime in 2018.

In the short term, we don't have the capacity to crawl the full CrUX list, and we need some signal to help pick out the "high value" origins.. For that, intersecting old Alexa (e.g. 20170315) with CrUX sgtm. Also, given that we don't trust ranks moving forward.. I'd suggest we stop surfacing them as well. As soon as we have enough capacity we can drop the requirement for Alexa list and bootstrap from CrUX.

Does that map to what you guys are thinking?

@pmeenan
Copy link
Member

pmeenan commented Dec 21, 2017

sgtm

@rviscomi
Copy link
Member

rviscomi commented Jan 3, 2018

As soon as we have enough capacity we can drop the requirement for Alexa list and bootstrap from CrUX.

One thing that comes to mind here is that there would be two distinct shifts in the data: replacing the Alexa 500K with the Alexa+CrUX ~500K, then expanding to a pure CrUX ~1M. I wonder if it would be worth waiting for the capacity improvements and skipping over the Alexa+CrUX hybrid. These changes will have a big effect on the continuity of many if not all metrics and it may be preferable to have one dramatic shift than two.

@igrigorik
Copy link
Contributor Author

Do we have a good guesstimate for when we could do the "full" migration?

@pmeenan
Copy link
Member

pmeenan commented Jan 3, 2018

sgtm, particularly since we have a near-term plan for the capacity increase.

@pmeenan
Copy link
Member

pmeenan commented Jan 3, 2018

Probably a couple of months on the infrastructure side depending on the hardware order going through and migration of the existing server.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants