As promised in my intro, here’s a little bit of cu...
# share-your-work
As promised in my intro, here’s a little bit of current thinking I wrote up about what feels like I’m going to end up building a DSL to let ecologists work on large datasets. Currently it’s a Python library that lets ecologists just work with large geospatial files as if they’re variables, like numpy does, but manages the memory side of things, as these files quickly can cause you to run out of memory even on a 1TB RAM machine like we have in the group. The next thing I have in the wings is trying to hide Python multiprocessing support behind my library, for two reasons: • I’m already chunking up the computation, so throwing it out to many cores to get concurrency seems like a natural fit. Early on in my time here I tried to encourage the ecologists to think in terms of small programs that would then be run concurrently, but that didn’t really work for how they think of programming, so I think letting them think single threaded whilst I hide concurrency is probably easiest. • GDAL, the library used for a bunch of the actual geospatial transforms is leaky as anything (or rather, it’s Python bindings are), and so pushing that into child processes is a win, as it’s the only real way to clean up properly. But at this point I think what I’m not doing is a good fit for Python anymore, so I kind of envisage a Go backend, where I handle the concurrency side of things, and a small front end language where I let the ecologists reason about geospatial files as opaque blobs, and possibly have CSV as a natural thing. But I also think this is a good fit for a visual programming language, where you connect the CSVs and geospatial files by the kind of operators you’d do normally, and then having that be the ecologist view on the world. I guess my main aim is to try and not do everything though - if you imagine this was wildly successful, then all I’m doing is making yet another data-processing system that is specialised in one thing that someone will then want to do what I’ve done for some other metric down the line. So I think I want to deliberately keep this focussed/niche rather than accidentally drift into building something generic that will inevitably not be good for other purposes.
Copy code
elevation_layer = Layer.layer_from_file('elecation.tiff')
area_layer = UniformAreaLayer('area.tiff')
validity_layer = Layer.layer_from_file('validity.tiff')

# Work out the common subsection of all these and apply it to the layers
intersection = Layer.find_intersection(elecation_layer, area_layer, validity_layer)

# Work out the area where the data is valid and over 3000ft
def is_munro(data):
    return numpy.where(data > 3000.0, 0.0, 1.0)
result = validity_layer * area_layer * elevation_layer.apply(is_munro)

result_band = result_gdal_dataset.GetRasterBand(1)
I don't follow why the
is in the computation of
. Or is it perhaps not needed in the definition of
, just
? Are the multiplications over matrices (so the ranks need to line up)? Just trying to make sure I understand the example. I don't know anything about GDAL.
Interesting stuff, I'll have to look at it in more detail (I am also in scientific computing). But a first reaction: isn't it surprising that ecologists have a hard time thinking in terms of small interacting programs? Isn't that very similar to an ecosystem in nature? In other words, shouldn't it be possible to present the situation in a way that they understand it?
@Kartik Agaram Oh, I’d not read too much into the example in detail, this was derived from a real example we worked on. The area_layer is a GeoTiff that contains in each pixel the real area of the planet covered by that pixel. So if that last pair of lines was replaced with:
Copy code
The number you’d get would be the square meterage of planet covered by munros. (Assuming validity layer is a layer that just covers Scotland 🙂 For the actual work we used this tiff in another set of calculations. In that work we’re reasoning about area of habitat of species, so the result in that case is a GeoTIFF per species that has a 0 value where the species isn’t, and an area of the pixel if the species is there.
@Konrad Hinsen I think the problem is less that ecologists have hard time thinking about small programs, they’re quite happy scripting away, but that the side effects of actions are not always apparent. This, trying to bring it around to the Future of Coding’s most recent episode, to me is why I don’t think you can “Do The Right Thing” as readily as people think - there’s some unexpected side effect going to bite you that you didn’t think of (unless you ground up wrote your OS and application in a provable language I guess, but I’m not that person 🙂 It’s then less about capability, more about time to be experts? Knowing all those side-effects (if you run out memory Linux will vomit everywhere and die) is a computerists job to understand. The interaction of animal species and habitats is for the ecologists to understand, I don’t know how to do that for similar reasons they don’t know about lazy evaluation or such. But I don’t think it should be that an ecologists needs me all the time - I want to build a system that Does The Right Thing by having me encode the grotty resource management inside.
Doing "the right thing" is possible only if the abstractions you implement are actually implementable with the available resources and for the intended applications. Otherwise you implement leaky abstractions, which is not helping clients that much.
I read your post in the meantime... Some remarks on NumPy (I was part of the team that designed its predecessor, Numeric): the goal of Numeric (inherited by NumPy) was to provide 1) "APL embedded into Python" and 2) a Python interface to array data used in C extensions. APL, in turn, was originally designed to be a new mathematical notation for use by humans, and only later implemented on computers. This implies the "right thing" approach to arrays as a high-level data model. Back in those days (mid-1990s), scientific computing applications tended to be more CPU-limited than memory-limited, so we thought of an escape hatch for performance bottlenecks (C extensions, and later Cython), but not so much for memory bottelnecks. That was exactly what caused the fork called "numarray", which catered for large-memory use cases but turned out to be very inefficient for small arrays. NumPy re-united the two, at the cost of a more complex interface, but didn't extend "large memory" to "doesn't fit into memory" datasets. None of that is clear from the documentation, as so often. So people like you discover the unwritten assumptions at some cost. Sorry!
@Konrad Hinsen that’s super interesting, thanks for the insight! No need to apologise, I learned a lot along the way 🙂