You are viewing a single comment's thread from:

RE: Learn Python Series (#13) - Mini Project - Developing a Web Crawler Part 1

in #utopian-io7 years ago

Thanks for your in-depth reply! I know Scrapy, but I don't see any real value in it compared to using BS4 + Requests vs using Scrapy.

Selenium is an option, having a nodeJS subprocess using nightmareJS instead is another. The pyvirtualdisplay + xvfb option to "give a head" to a headless browser is indeed an option, but what's the core functionality of having a headless browser for "clientside event automation"? Automating things without the need to do so manually and/or rendering the DOM. Nothing wrong with a few print() statements while developing / debugging! Works just fine! ;-)

PS1: This was just part 1 of the web crawler mini-series. More will come very soon, be sure to follow along!
PS2: Since I'm treating my total Learn Python Series as an interactive Python Book in the making (publishing episodes as tutorial parts via the Steem blockchain as we go), I must consider the sorted orderor subjects already discussed very carefully. Therefore I might not use technical mechanisms I would normally use in a real-life software development situation. For example, I haven't explained using a mongoDB instance for example, interfacing with it using pymongo, which of course would be my GoTo tool at developing a real-life web crawler. Instead, because I did just (earlier this week, 2 episodes ago) explained "handling files", I will (for now, in part 2 of the web crawler mini-series) just use plain .txt files (for I haven't discussed using CSV nor JSON either) for intermediate data storage.

See you around!
@scipio

Sort:  

I understand :)

I didn't know nightmareJS it seems to have pretty nice tools, I'm going to give it a try.

Thank's!

@zerocoolrocker