Let There Be Coast!

Note: This post is from Briana Harder, our newest Science Team member! We encountered Briana in Talk where she not only noted some issues, but then wrote code to reprocess images to fix them! Needless to say, we were impressed. What emerged was a wonderful dialogue between Briana, members of the science team, and the folk at Zooniverse. She’s made some large changes to our image processing pipeline and helped us all learn a lot about how to use Landsat for kelp in places *other* than California. As such, we asked Briana if she wanted to take her involvement to the next level, and join the Science Team. And we were delighted when she accepted! So, here are her comments on the awesome work she did and how our image processing has changed.

Hello, I’m Briana Harder, I’ve recently joined the science team to work on the processing pipeline that creates the images we classify. Some of you may know me from Talk as Quia. Back in November 2014, there was a blog post that ended with a shout out to anyone who wanted to collaborate on making the image selecting process better. I’m not a PhD in anything, but learning something new to solve a real life problem sounds like my idea of a fun weekend, and I have some experience with image processing. Turns out I’ve spent a wee bit more than a weekend on this, and I did make progress on the problem, so here I am to explain how my ‘fun project’ turns into improving the Floating Forests image pipeline!

The first thing to do upon finding an interesting problem is to find out if anyone else has solved it already. So I searched for research in the areas of image analysis and coastlines and satellite imagery. The majority of the papers were far too detail oriented to be very helpful, the problems in tracking the month to month changes of the coastline of a small island are wildly different from sorting coast from non-coast for FF! But I did find a fascinating paper on using Landsat data to build a highly accurate waterline database for all of Europe. They clearly solved the problem of finding ocean coastline, and then went a lot further!

The technique they used was to take a cloudless mosaic of the region–lots of preprocessing there!– and separate the image into three regions, water and land, selected with simple pixel value thresholds, and unassigned pixels. They then ran a region growing algorithm to add the unassigned pixels to either area.

This was good find for me, because they’re solving a very similar problem, and I know how to implement both those things! Unfortunately region growing is relatively slow and expensive, and it probably wouldn’t play nice with cloudy images. I did more digging over the next week, without finding anything else that was more promising. So I sat down, and wrote a little program.

Simplicity is important when you’re working with a lot of data; if the running time of the algorithm is longer than a person would take to do the same task, something has gone horribly wrong! I went through a couple iterations on how to find water, but in the end, this is what I ended up with.

Water is any pixel where the red value is between 1 and 25. Water’s very dark in all the bands, but it’s darkest in red, so that’s the best way to find it. If we’re clever about it, we only need to read the pixel values once, and perform some simple math operations, which means it should hardly take longer than opening up the image to view it.

– Count all the pixels that are water.
– Count all the pixels that are black, value 0. This ensures it’s not biased to throw out images that are on the edges of the Landsat scene.
– Calculate the percentage of non-black pixels that are water.
– If that percentage is above a certain threshold, we’re good to go, keep this image. I picked 5% as the threshold, based on a little trial and error.

And that’s it! It by no means gets rid of ALL the non-coast images, for example this does absolutely nothing for the abundance partially cloudy ocean images. It also gets tripped up by dark shadows on land, either from clouds or mountains, as shadows are just dark enough to fall within that threshold. Lakes are also selected, if they’re big enough.

The more complicated part comes after algorithms are made and tested: building them into the existing image processing pipeline. I wrote my algorithm in Python, making use of a few key libraries to do all the image processing; the pipeline is in Ruby, and uses a tool call ImageMagick for its image processing. I’m good at programming Python, I’d never touched Ruby until working on this project! And ImageMagick does seem quite ‘magical’ to someone who hasn’t used it before.

After reducing the problem of non-coast images, there’s the problem of the dark and red images that are especially common in the Tasmania dataset. The red part has been solved, but the darkness is still there for a lot of images. I have more work to do! But for now, we can say goodbye to a big chunk of the non-coast images in the next data set. No more bright blue snow-capped mountains, or solid fluffy cloud tops, or endless squares of farmland.

I’ll see you on Talk!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: