Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error attempting to claim book from newsletter #47

Open
lucymhdavies opened this issue Jan 31, 2017 · 39 comments
Open

Error attempting to claim book from newsletter #47

lucymhdavies opened this issue Jan 31, 2017 · 39 comments

Comments

@lucymhdavies
Copy link

~ $ python script/spider.py --config config/prod.cfg --notify ifttt --claimOnly

                      __   __              __                                   __
    ____  ____ ______/ /__/ /_____  __  __/ /_        ______________ __      __/ /__  _____
   / __ \/ __ `/ ___/ //_/ __/ __ \/ / / / __ \______/ ___/ ___/ __ `/ | /| / / / _ \/ ___/
  / /_/ / /_/ / /__/ ,< / /_/ /_/ / /_/ / /_/ /_____/ /__/ /  / /_/ /| |/ |/ / /  __/ /
 / .___/\__,_/\___/_/|_|\__/ .___/\__,_/_.___/      \___/_/   \__,_/ |__/|__/_/\___/_/
/_/                       /_/

Download FREE eBook every day from www.packtpub.com
@see github.com/niqdev/packtpub-crawler

[*] 2017-01-31 10:30 - fetching today's eBooks
[*] configuration file: /app/config/prod.cfg
[*] getting daily free eBook
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
[+] notification sent to IFTTT
[*] getting free eBook from newsletter
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/practical-data-analysis
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@123
Traceback (most recent call last):
  File "script/spider.py", line 123, in main
    packtpub.runNewsletter(currentNewsletterUrl)
  File "/app/script/packtpub.py", line 160, in runNewsletter
    self.__parseNewsletterBookInfo(soup)
  File "/app/script/packtpub.py", line 98, in __parseNewsletterBookInfo
    title = urlWithTitle.split('/')[4].replace('-', ' ').title()
IndexError: list index out of range
[+] error notification sent to IFTTT
[*] done
~ $

It has successfully claimed the book from the newsletter already, but on subsequent days I'm getting the above error.

And it sends an IFTTT notification for the second one :(

@lucymhdavies
Copy link
Author

Updated to the latest code from master of this repo. Issue still present.

~ $ python script/spider.py --config config/prod.cfg --claimOnly

                      __   __              __                                   __
    ____  ____ ______/ /__/ /_____  __  __/ /_        ______________ __      __/ /__  _____
   / __ \/ __ `/ ___/ //_/ __/ __ \/ / / / __ \______/ ___/ ___/ __ `/ | /| / / / _ \/ ___/
  / /_/ / /_/ / /__/ ,< / /_/ /_/ / /_/ / /_/ /_____/ /__/ /  / /_/ /| |/ |/ / /  __/ /
 / .___/\__,_/\___/_/|_|\__/ .___/\__,_/_.___/      \___/_/   \__,_/ |__/|__/_/\___/_/
/_/                       /_/

Download FREE eBook every day from www.packtpub.com
@see github.com/niqdev/packtpub-crawler

[*] 2017-01-31 10:36 - fetching today's eBooks
[*] configuration file: /app/config/prod.cfg
[*] getting daily free eBook
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
[*] getting free eBook from newsletter
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/practical-data-analysis
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@123
Traceback (most recent call last):
  File "script/spider.py", line 123, in main
    packtpub.runNewsletter(currentNewsletterUrl)
  File "/app/script/packtpub.py", line 160, in runNewsletter
    self.__parseNewsletterBookInfo(soup)
  File "/app/script/packtpub.py", line 98, in __parseNewsletterBookInfo
    title = urlWithTitle.split('/')[4].replace('-', ' ').title()
IndexError: list index out of range
[*] done

@juzim
Copy link
Contributor

juzim commented Jan 31, 2017

Can confirm, I'll look into it. For a quickfix, create a file named "lastNewsletterUrl" containing "https://www.packtpub.com/packt/free-ebook/practical-data-analysis" in the config folder. This should stop the errors for now.

@juzim
Copy link
Contributor

juzim commented Jan 31, 2017

Fixed in #48

@lucymhdavies
Copy link
Author

nice. thanks 👍

@trancen
Copy link

trancen commented Jan 31, 2017

Getting this error now after updating : on Version - 2.2.4

[+] new download: /home/david/packtpub-crawler//home/david/packtpub-crawler/ebooks/extras/MySQL_for_Python.zip
[*] getting free eBook from newsletter
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/practical-data-analysis
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@123
Traceback (most recent call last):
  File "script/spider.py", line 123, in main
    packtpub.runNewsletter(currentNewsletterUrl)
  File "/home/david/packtpub-crawler/script/packtpub.py", line 160, in runNewsletter
    self.__parseNewsletterBookInfo(soup)
  File "/home/david/packtpub-crawler/script/packtpub.py", line 98, in __parseNewsletterBookInfo
    title = urlWithTitle.split('/')[4].replace('-', ' ').title()
IndexError: list index out of range
[*] done

@juzim
Copy link
Contributor

juzim commented Jan 31, 2017

Yes, this is the same error. The fix should be merged tonight hopefully or you can change the line yourself.

@trancen
Copy link

trancen commented Jan 31, 2017

Sorry just noticed the comment above about created the "lastNewsletterUrl" that worked.

@niqdev
Copy link
Owner

niqdev commented Jan 31, 2017

Thanks @juzim , the problem above is fixed, but actually there is a bug, see the log below (I've hidden some variables/paths)

  • upload 2 times the ebook from the newsletter with 2 different name (but the file is the same)
  • I received 2 email, the first one is correct, the second again contains 2 link with 2 pdf but are the same
python script/spider.py -c config/prod.cfg -u drive -s firebase -n gmail


                      __   __              __                                   __
    ____  ____ ______/ /__/ /_____  __  __/ /_        ______________ __      __/ /__  _____
   / __ \/ __ `/ ___/ //_/ __/ __ \/ / / / __ \______/ ___/ ___/ __ `/ | /| / / / _ \/ ___/
  / /_/ / /_/ / /__/ ,< / /_/ /_/ / /_/ / /_/ /_____/ /__/ /  / /_/ /| |/ |/ / /  __/ /
 / .___/\__,_/\___/_/|_|\__/ .___/\__,_/_.___/      \___/_/   \__,_/ |__/|__/_/\___/_/
/_/                       /_/

Download FREE eBook every day from www.packtpub.com
@see github.com/niqdev/packtpub-crawler
        
[*] 2017-01-31 18:45 - fetching today's eBooks
[*] configuration file: XXX/github/packtpub-crawler/config/prod.cfg
[*] getting daily free eBook
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
[-] downloading file from url: https://www.packtpub.com/ebook_download/12890/pdf
[################################] 9985/9985 - 00:00:01
[+] new download: XXX/packtpub-crawler/ebooks/MySQL_for_Python.pdf
[+] new file upload on Drive:
[+] uploading file...
-[+] updating file permissions...
-       [path] XXX/packtpub-crawler/ebooks/MySQL_for_Python.pdf
	[download_url] UUU
	[name] MySQL_for_Python.pdf
	[mime_type] application/pdf
	[id] III
[+] Stored on firebase: KKK
[+] notified to: ['aaa', 'bbb']
[*] getting free eBook from newsletter
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/practical-data-analysis
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
[-] downloading file from url: https://www.packtpub.com/ebook_download/12890/pdf
[################################] 9985/9985 - 00:00:01
[+] new download: XXX/packtpub-crawler/ebooks/Practical_Data_Analysis.pdf
[+] new file upload on Drive:
[+] uploading file...
|[+] updating file permissions...
-       [path] XXX/packtpub-crawler/ebooks/MySQL_for_Python.pdf
	[download_url] https://drive.google.com/uc?id=ZZZ
	[name] MySQL_for_Python.pdf
	[mime_type] application/pdf
	[id] LLL
[+] new file upload on Drive:
[+] uploading file...
|[+] updating file permissions...
\       [path] XXX/packtpub-crawler/ebooks/Practical_Data_Analysis.pdf
	[download_url] https://drive.google.com/uc?id=DDD
	[name] Practical_Data_Analysis.pdf
	[mime_type] application/pdf
	[id] YYY
[+] Stored on firebase: WWW
[+] notified to: ['aaa', 'bbb']
[*] done

@niqdev
Copy link
Owner

niqdev commented Jan 31, 2017

Actually I started the script again and it seems that the 3 book are identical and the daily ebook is ignored and I have 3 copy of the same book (newsletter). Moreover I checked on firebase and the data uploaded are mixed e.g. different filename but same author (of the newsletter).

@juzim
Copy link
Contributor

juzim commented Jan 31, 2017

I think we have to reset the Packtpub instance before handling the newsletter but then this would have been broken for weeks. Could you check?

Also, could you comment out spider.py:22 so we can see the contents of each handled packtpub.info?

@niqdev
Copy link
Owner

niqdev commented Jan 31, 2017

Yep,
so first log (see the author is wrong) and noticed now paths is empty

...
[*] getting daily free eBook
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
{
  "url_source_code": "https://www.packtpub.com/code_download/12891", 
  "paths": [], 
  "description": "MySQL for Python is the essential ingredient for building productive and feature-rich Python applications as it provides powerful database support and will also take the burden off your webserver. This eBook shows how to boost the productivity and maintainability of your Python apps by integrating them with the MySQL database server. It will take you from installing MySQL for Python on your platform of choice all the way through to database automation and administration. Every chapter is illustrated with a practical project that you can use during your own app development process. This eBook is free for today only so don\u2019t miss out!", 
  "title": "MySQL for Python", 
  "author": "Hector Cuesta", 
  "filename": "MySQL_for_Python", 
  "book_id": "12890", 
  "url_claim": "https://www.packtpub.com/freelearning-claim/5286/21478", 
  "url_image": "https://dz13w8afd47il.cloudfront.net/sites/default/files/imagecache/dotd_main_image/0189OS_MockupCover_0.jpg"
}
[+] book successfully claimed
...

and second log

...
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
{
  "url_source_code": "https://www.packtpub.com/code_download/12891", 
  "paths": [
    "XXX/packtpub-crawler/ebooks/MySQL_for_Python.pdf"
  ], 
  "description": "Get started in data analysis with this free 360 page eBook guide\nFor small businesses, analyzing the information contained in their data using open source technology could be game-changing. All you need is some basic programming and mathematical skills to do just that. This free data analysis eBook is designed to give you the knowledge you need to start succeeding in data analysis. Discover the tools, techniques and algorithms you need to transform your data into insight.\n\nVisualize your data to find trends and correlations\nBuild your own image similarity search engine\nLearn how to forecast numerical values from time series data\nCreate an interactive visualization for your social media graph", 
  "title": "Practical Data Analysis", 
  "author": "Hector Cuesta", 
  "filename": "Practical_Data_Analysis", 
  "book_id": "12890", 
  "url_claim": "https://www.packtpub.com/promo/claim/12891/27564", 
  "url_image": "https://d1ldz4te4covpm.cloudfront.net/sites/default/files/B02731_Practical Data Analysis.jpg"
}
[+] book successfully claimed
...

@niqdev
Copy link
Owner

niqdev commented Jan 31, 2017

About the first question, we probably have to reset everything before a new claim, but is this the first newsletter since they reactiveted the free ebook? The other books seems correct, I checked also the logs on heroku

@niqdev
Copy link
Owner

niqdev commented Jan 31, 2017

Also the field url_source_code is wrong

@juzim
Copy link
Contributor

juzim commented Jan 31, 2017

Can you add packtpub = Packtpub(config, args.dev) to spider.py:123 and see if it fixes it? Sorry, but somehow the tests on my machine are broken...

@juzim
Copy link
Contributor

juzim commented Jan 31, 2017

Did you by any chance delete the lastNewsletterUrl file? Because if the script tries to grab an already claimed newsletter book from the archive, it won't find it on the top position and overwrite the data for whatever book is currently on there.

We have to either check if the book was already claimed (would be a nice feature anyways) or find the book in the list by name/id/etc.

I'll try to look into it tomorrow but it shouldn't happen again unless you delete the file.

@niqdev
Copy link
Owner

niqdev commented Jan 31, 2017

I'm testing now, using docker, with the change that you suggested i.e. add lines 123

packtpub = Packtpub(config, args.dev)
packtpub.runNewsletter(currentNewsletterUrl)

I have the following error

...
[+] book successfully claimed
[+] created new directory: /packtpub-crawler/ebooks
[-] downloading file from url: https://www.packtpub.com/ebook_download/12890/pdf
[+] new download: /packtpub-crawler/ebooks/MySQL_for_Python.pdf
[+] new file upload on Drive:
\[+] uploading file...
\[+] updating file permissions...
/Traceback (most recent call last):
  File "script/spider.py", line 124, in main
    packtpub.runNewsletter(currentNewsletterUrl)
  File "/packtpub-crawler/script/packtpub.py", line 160, in runNewsletter
    self.__parseNewsletterBookInfo(soup)
  File "/packtpub-crawler/script/packtpub.py", line 105, in __parseNewsletterBookInfo
    self.info['url_claim'] = self.__url_base + claimNode[0]['href']
IndexError: list index out of range
	[path] /packtpub-crawler/ebooks/MySQL_for_Python.pdf
...

@juzim
Copy link
Contributor

juzim commented Jan 31, 2017

When we reset the whole packtpub we also loose the login information, so it won't work. I added a method to reset the packtpub.info but this won't solve the other issue.

@niqdev
Copy link
Owner

niqdev commented Jan 31, 2017

We should probably just reset in Packtpub.py

self.info = {
  'paths': []
}

What do you think?

@niqdev
Copy link
Owner

niqdev commented Jan 31, 2017

No same problem, we lose the session

@juzim
Copy link
Contributor

juzim commented Jan 31, 2017

I think the solution is to see if the book is already claimed before further processing it. The claim response page contains an error message that we can parse. I try to submit a patch tomorrow.

@niqdev
Copy link
Owner

niqdev commented Feb 1, 2017

About your solution I just don't like the fact that we have to do another request, we should be able to reset all the fields before.
This is just my thought, but this is were mutable state sucks (we are also missing tests) and a purely functional approach would help us a lot. By the way, I'm not gonna rewrite anything..haha

@juzim
Copy link
Contributor

juzim commented Feb 1, 2017 via email

@juzim
Copy link
Contributor

juzim commented Feb 1, 2017

While I'm not a fan of flow control by exceptions, I think it works just fine in this case. Your code is solid and can surely handle quite some more beating before a rewrite would be necessary.

Fixed in #50

This fix looks for a specific error message on the claim result page (which curiously only exists for the newsletter, not the dailies) which should work for now. No additional requests are made.

Assuming that the first entry in the archive is always the book we are processing at the moment might cause further trouble (packtpub might switch to alphabetic sorting for example). But the only way to see if the book was already claimed in the archive right now is searching by name which is prone to errors since we parse the title from the claim URL.
In my opinion, the right way to do this is not parsing the latest entry, but synchronizing the archive with the local downloads folder (stepping through all books on the page and downloading those that are missing) after a claim. Since we can generate the file name from the list entry title instead of the claim URL this way, we can securely match them.

This would also resolve #23

Any volunteers? :)

@lucymhdavies
Copy link
Author

In my opinion, the right way to do this is not parsing the latest entry, but synchronizing the archive with the local downloads folder (stepping through all books on the page and downloading those that are missing) after a claim.

How would that work if you're running with --claimOnly?

@juzim
Copy link
Contributor

juzim commented Feb 1, 2017

The script would just claim the book and you can download it later manually or run it with a "downloadAll" parameter that only syncs the archive with the local folder. Notifications etc are handled on claim, not download.

@niqdev
Copy link
Owner

niqdev commented Feb 1, 2017

@juzim , I'll keep monitoring until next newsletter. I'll create tag 2.2.5.
About your solution with the local search is fine, about me, how you have seen, I haven;t much time in this period and working on other projects aswell. If you leave it there I may do it, just don;t know when.
Thanks

@niqdev
Copy link
Owner

niqdev commented Apr 2, 2017

Just to keep track, it needs further investigation.
Anyway sometimes the script is able to download the newsletter, while for example this week there is this error

[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/what-you-need-know-about-angular-2
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@125
Traceback (most recent call last):
  File "script/spider.py", line 125, in main
    packtpub.runNewsletter(currentNewsletterUrl)
  File "PATH/packtpub-crawler/script/packtpub.py", line 169, in runNewsletter
    self.__parseNewsletterBookInfo(soup)
  File "PATH/packtpub-crawler/script/packtpub.py", line 101, in __parseNewsletterBookInfo
    urlWithTitle = div_target.select('div.promo-landing-book-picture a')[0]['href']
IndexError: list index out of range

@juzim
Copy link
Contributor

juzim commented Apr 2, 2017 via email

@niqdev
Copy link
Owner

niqdev commented Apr 2, 2017

No, unfortunately the claiming is not working too.

@niqdev
Copy link
Owner

niqdev commented Apr 2, 2017

The div promo-landing-book-picture doesn't exists

@juzim
Copy link
Contributor

juzim commented Apr 2, 2017 via email

@mkarpiarz
Copy link

Looks like some of the divs has been renamed on the newsletter's landing page. I compared the page for an older book:

    <div class="book-top-block-wrapper cf">
        <div class="cf section-inner">
            <div class="float-left promo-landing-book-picture">
                <div itemprop="image" itemtype="http://schema.org/URL" itemscope>
                    <a href="/web/20170113204509/https://dz13w8afd47il.cloudfront.net/networking-and-servers/mastering-aws-development">
                        <img src="/web/20170113204509im_/https://d1ldz4te4covpm.cloudfront.net/sites/default/files/3632EN_Mastering%20AWS%20Development.jpg" class="bookimage" />
                    </a>
                </div>
            <div class="float-left promo-landing-book-info">
                <div class="promo-landing-book-body-title">
                                    </div>
                <div class="promo-landing-book-body">
                    <div><h1>Claim your free 416 page Amazon Web Services eBook!</h1>
<p>This book is a practical guide to developing, administering, and managing applications and infrastructures with AWS. With this, you'll be able to create, design, and manage an entire application life cycle on AWS by using the AWS SDKs, APIs, and the AWS Management Console.</p>
</div>
                </div>
                            </div>

with the current one:

<div id="main-book" class="cf nano" itemscope itemtype="http://schema.org/Book">
    <div class="book-top-block-wrapper cf">
        <div class="cf section-inner">
            <div class="float-left nano-book-main-image">
                <div itemprop="image" itemtype="http://schema.org/URL" itemscope>
                    <a class="fancybox" href="///d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg">
                        <img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg" class="bookimage" />
                    </a>
                </div>
            <div class="float-left nano-book-text">
                <h1>What you need to know about Angular 2</h1>
                <div><strong>Get to grips with the ins and outs of one of the biggest web dev revolutions of this decade with the aid of this free eGuide! From setting up the very basics of Angular to making the most of Directives and Components you’ll discover everything you need to get started building your own web apps today.</strong></div>
                <div id="nano-learn">
                    <div id="nano-learn-title">
                        <div id="nano-learn-title-text">
                            <span id="nano-learn-title-text-inner">
                                What You Will Learn                            </span>
                        </div>
                    </div>

and came up with this hotfix:
master...mkarpiarz:fix_newsletter_divs
I haven't tested email notifications yet, so I'm not sure how the description would look like, but claiming a newsletter ebook seems to work now. Happy to submit a PR if @juzim haven't started working on this yet.

@juzim
Copy link
Contributor

juzim commented Apr 9, 2017

That would be great, thanks!

@CrazySerGo
Copy link

CrazySerGo commented Apr 9, 2017

Hi Guys,
I'm creating google script that parsing PacktPab tweets(it comes from @juzim google script). I'm not sure but there is a chance that all books from newsletters also will be published on their Twitter and no needs to fix it :) joking.
It's not finished - should exclude duplicates and check does link still available or not. If you have time, please look on output if it's fine for crawler or not https://goo.gl/AXtAC8

@juzim
Copy link
Contributor

juzim commented Apr 9, 2017

The link doesn't work for me, can you create a pull request please?

Also, while there are tons of free books on the feed, they repeat a lot so we have to make sure the duplication check works.

@mkarpiarz
Copy link

Is there a reason for the newsletter spreadsheet being empty even though this week's free ebook is still available under https://www.packtpub.com/packt/free-ebook/what-you-need-know-about-angular-2?

@niqdev
Copy link
Owner

niqdev commented Apr 9, 2017

@mkarpiarz before merge the PR can you please confirm that the email/notifications are still working? Thanks

@juzim
Copy link
Contributor

juzim commented Apr 9, 2017

@mkarpiarz I removed it to prevent error messages until the issue is fixed

@mkarpiarz
Copy link

@juzim - that's fine for now since there is an option to self-host the file.

I haven't tested email notification yet, @niqdev, but I printed out all the variables in this __parseNewsletterBookInfo method and I noticed this in the output:

self.info['title']: u'5612_Wyntkangular_Ebook_500X617.Jpg'
self.info['filename']: '5612_Wyntkangular_Ebook_500X617.Jpg'

This is because the code from:

urlWithTitle = div_target.select('div.promo-landing-book-picture a')[0]['href']

is extracting book titles based on the url inside the class where the book cover is and for this week's free ebook the relevant part looks like this:

<a class="fancybox" href="///d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg">
  <img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg" class="bookimage" />
</a>

So there is no link with the book title and instead the url points to the location of the cover image.

I'll create a separate thread for this title parsing issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants