Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Newsletter] Problems with parsing extracting the book title from an url #65

Open
mkarpiarz opened this issue Apr 10, 2017 · 1 comment

Comments

@mkarpiarz
Copy link

As I mentioned in #47 (comment), the newsletter parser gets the book title from the url behind the image cover.

urlWithTitle = div_target.select('div.promo-landing-book-picture a')[0]['href']

This will work fine if the link on the landing page points to the main book page like it was the case here: https://www.packtpub.com/packt/free-ebook/amazon-web-services-free

<a href="/networking-and-servers/mastering-aws-development">
  <img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/3632EN_Mastering AWS Development.jpg" class="bookimage" />
</a>

but will yield some unexpected results when this href points to, for example, a cover image - like here: https://www.packtpub.com/packt/free-ebook/what-you-need-know-about-angular-2

<a class="fancybox" href="///d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg">
  <img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg" class="bookimage" />
</a>

The latter will result in

title = urlWithTitle.split('/')[-1].replace('-', ' ').title()
becoming '5612_Wyntkangular_Ebook_500X617.Jpg' instead of the correct title. And a wrong title will also mess up the filename under which the books is written to the disk making it '5612_Wyntkangular_Ebook_500X617.Jpg.{pdf,mobi,epub}'.

An alternative to this would be to use the string inside the h1 tag of the title-bar-title div like here: mkarpiarz@c583d37.
But this also doesn't seem to be always reliable, e.g.:

<div id="title-bar-title"><h1>Free Amazon Web Services eBook</h1></div>
@niqdev
Copy link
Owner

niqdev commented Apr 11, 2017

I would suggest to go for the h1 tag and if for some reason is missing use the other as fallback, maybe removing with a regexp the numbers, the output probably will not be nice but at least it should work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants