You are viewing a single comment's thread from:

RE: Tutorial: Exploring raster and vector geographic data with rasterio and geopandas

in #utopian-io7 years ago (edited)

If I'm not mistaken (Its been long time since I touch python :D) almost library that derived from pandas is loading data into memory. For example, if I load data from PostGIS database using geopandas then It will load from that database into memory which is some computer with limited RAM can become hang. Any idea to anticipate this?

Also, if it's not a burden, would you share the memory usage and how long it takes to display the map? (just the estimation is okay)

And some tips for me to make your next tutorial more readable. Try to zoom in (CTRL + mousescroll) your Jupyter notebook before taking a screenshot :)

Sort:  

Hey @drsensor
Here's a tip for your valuable feedback! @Utopian-io loves and incentivises informative comments.

Contributing on Utopian
Learn how to contribute on our website.

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Hi @drsensor. Thanks for commenting. Yes, this tutorial is handling data in memory. My computer has 8 GB of RAM and the files I used are in total less than 1GB ( I think ) . I can try to get you the actual amount of memory being used when I got a chance to do it. If you have to use much more data than this, then you either need more RAM (there are cloud servers with lots of RAM that you can rent for a few hours and a few dollars) or you need to optimize all this processing. Optimizing raw data analysis is a whole subject in its own. You may use cython and map reduce algorithms.

Next time I will try to make the images compatible with small devices. Thanks for the suggestion!

Nice, thank you for your time, looking forward on that 😊.
I see, seems it will take ~800MB RAM. Be careful not to accidentally re-run the notebook (.ipynb) more than 7 times (800MB x 8). If you want to re-run it make sure to exit the notebook first to free the memory.

Actually long ago, I do a project about image processing task and after reading my notes, I do something like lazy evaluation which only loads and computes on specific parts when it's needed. There is some library that I want to use back then but it only one of these libs that I actually use (time constraint project 😂 ). Maybe you want to experiment with one of these libraries (if you hit memories problem or have some cluster computers):

  • Blaze: I'm interested in trying this but never had a chance. You can do batching, set the chunk size, then do something like external memory computation (maybe)
  • Spark/PySpark: At that time I'm not really interest using this because rigorous installation process for a single laptop and seems doesn't suit my use case.
  • Dask: I use this. This really suitable for my use case because of some part of the algorithm can run on GPU which is I can distribute the computation between CPU and GPU evenly (but not perfect). The downside is it really take a time to build good DAG. In my notes seems like I use delayed function a lot when loading images and do pre-processing tasks.

Thanks for the great links. I didn't know a couple of these. I have not encountered the oportunity to handle data this big, but hopefully I will and these options will come handy. Looks you know quite a bit about computationally intensive analysis. Hope to see some tutorials from you about this ;)