You are viewing a single comment's thread from:

RE: Learn Python Series (#13) - Mini Project - Developing a Web Crawler Part 1

in #utopian-io7 years ago

Great post! I didn't know that beautifulsoup accepts Jquery like css filters :)

It would be very interesting a couple of tutorials about scrapy module :) and make the same given exercises using scrapy (scrapy is very comfortable for scraping and most of all crawling).

Also another one scraping "dynamic" sites which are modified in runtime by javascript. (Also Selenium could be used) there is a python module called pyvirtualdisplay which uses xvfb to run X session in background. It is useful to by example use Selenium as headless browser :)

Great posts!

Sort:  

Thanks for your in-depth reply! I know Scrapy, but I don't see any real value in it compared to using BS4 + Requests vs using Scrapy.

Selenium is an option, having a nodeJS subprocess using nightmareJS instead is another. The pyvirtualdisplay + xvfb option to "give a head" to a headless browser is indeed an option, but what's the core functionality of having a headless browser for "clientside event automation"? Automating things without the need to do so manually and/or rendering the DOM. Nothing wrong with a few print() statements while developing / debugging! Works just fine! ;-)

PS1: This was just part 1 of the web crawler mini-series. More will come very soon, be sure to follow along!
PS2: Since I'm treating my total Learn Python Series as an interactive Python Book in the making (publishing episodes as tutorial parts via the Steem blockchain as we go), I must consider the sorted orderor subjects already discussed very carefully. Therefore I might not use technical mechanisms I would normally use in a real-life software development situation. For example, I haven't explained using a mongoDB instance for example, interfacing with it using pymongo, which of course would be my GoTo tool at developing a real-life web crawler. Instead, because I did just (earlier this week, 2 episodes ago) explained "handling files", I will (for now, in part 2 of the web crawler mini-series) just use plain .txt files (for I haven't discussed using CSV nor JSON either) for intermediate data storage.

See you around!
@scipio

I understand :)

I didn't know nightmareJS it seems to have pretty nice tools, I'm going to give it a try.

Thank's!

@zerocoolrocker