Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare two images of different versions of a page #15

Open
anjackson opened this issue Apr 3, 2019 · 4 comments
Open

Compare two images of different versions of a page #15

anjackson opened this issue Apr 3, 2019 · 4 comments

Comments

@anjackson
Copy link
Member

anjackson commented Apr 3, 2019

We would like to compare e.g. crawl-time or current-live site with the archival-playback version. One way to compare them is to compare the images using standard image comparison techniques.

This could also be applied to same-site-over-time etc.

Related work:

@edipretoro
Copy link
Contributor

Here is some details about what I've done so far.

I've been using SSIM (https://ieeexplore.ieee.org/document/1284395 and https://ece.uwaterloo.ca/~z70wang/publications/ssim.pdf). Here is the Python code used:

#!/usr/bin/env python

from skimage.measure import compare_ssim
import cv2
import imutils

original_filename = './static/screenshots/fff6db9a99884cca50e01cbef58a4f33d797920f.jpeg'
archived_filename = './static/screenshots/fff6db9a99884cca50e01cbef58a4f33d797920f-archived.jpeg'

original = cv2.imread(original_filename)
archived = cv2.imread(archived_filename)

if original is not None and archived is not None:
  original_gray = cv2.cvtColor(original, cv2.COLOR_BGR2GRAY)
  archived_gray = cv2.cvtColor(archived, cv2.COLOR_BGR2GRAY)
  (score, diff) = compare_ssim(original_gray, archived_gray, full=True)
  print("Difference between the two images is %f" % score)

Here is another try with perceptual hash:

#!/usr/bin/env python

from PIL import Image
import imagehash

original_filename = './data/fff6db9a99884cca50e01cbef58a4f33d797920f.jpeg'
archived_filename = './data/fff6db9a99884cca50e01cbef58a4f33d797920f-archived.jpeg'

hash_original = imagehash.phash(Image.open(original_filename))
hash_archived = imagehash.phash(Image.open(archived_filename))
if (hash_archived == hash_original):
    print("Those two images are identical")
else:
    print("Those two images are not identical")
print("Here is the difference: %d" % (hash_original - hash_archived))

And here is another algorithm named VQI and used by Switzerland:

import cv2

def vqi_compute(original, archived, sample_size, url):
    original = cv2.imread(original)
    archived = cv2.imread(archived)
    original_means_regions = []
    archived_means_regions = []
    for x in range(1, int(original.shape[1]/sample_size), 2):
        k = sample_size * x
        l = sample_size * x + sample_size
        for y in range(1, int(original.shape[0]/sample_size), 2):
            i = sample_size * y
            j = sample_size * y + sample_size
            original_means_regions.append(cv2.mean(original[i:j,k:l])[:3])
            archived_means_regions.append(cv2.mean(archived[i:j,k:l])[:3])
    sum = 0
    for i in range(len(original_means_regions)):
        o = original_means_regions[i]
        a = archived_means_regions[i]
        sum = sum + sqrt(pow(o[0] - a[0], 2) + pow(o[1] - a[1], 2) + pow(o[2] - a[2], 2))
    return (url, sum) 

For that implementation, I just copy the function because I haven't yet finished the implementation.

Hope this help!

@anjackson
Copy link
Member Author

This Python implementation of e.g. pHash might be helpful: https://github.com/JohannesBuchner/imagehash

@ibnesayeed
Copy link
Collaborator

While ImageMagick can be used for pixel-level diffs, https://github.com/myint/perceptualdiff is a great tool to measure perceptible differences.

@anjackson
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants