Tuesday, December 16, 2014

Flickr, beards and feature recognition

So it looks like Flickr is applying image recognition beyond face recognition. A couple of months ago they demoed their "Park or Bird?" feature recognition system, but I hadn't realized they'd put anything into production until today.

Yesterday, I uploaded some old photos from mid 2000 that I'd taken with my first digital camera, and had recently recovered (thanks Dana). This morning I noticed that someone had found the photo below by searching for "Beard". That seemed odd, since I hadn't tagged any of the old photos. The only way that Flickr could know that Gac has a beard is by looking at the photo itself.



Something feels a little bit creepy about this, but on the other hand it's also pretty cool.

I would like to see Flickr expose this more explicitly, and give me an opportunity to edit these automatically added tags.

Sunday, November 16, 2014

Picasa



I learned something about Picasa today: when you edit a photo, it stores all the modifications as extra fields in the JPEG file and doesn't modify the displayed image. Until you export the image from Picasa, other tools only see the original image.
I have a bunch of photos from before 2006 that I salvaged from an old Thinkpad, and which I'd touched up with Picasa, and I want to upload them to Flickr. But because the touch-ups are all in metadata, none of the images reflect this.
I downloaded the latest version of Picasa for OS X, imported all of these photos, and amazingly it seems to have correctly applied the changes (which were made with a much older version of the tool). They're not pixel-for-pixel identical -- I presume that some of the enhancement algorithms have changed -- but they're damn close.

Friday, August 22, 2014

The case of the missing Citibikes

The example of my friend and colleague Ben, with his amazing I Quant NY blog, has motivated me to try my hand at some open data hacking. Ben's written several posts where he analyzes Citibike bike share data. Citibike has made all of their trip data through the end of May 2014 available for free download. I'm a huge fan of New York's bike share program and of their open data policy.

Ben has analyzed trips and stations, but I have a different question: how many of New York's shared bikes have been stolen or lost?

The New York Post reports that bikes are routinely stolen from Manhattan stations and ridden to underserved parts of Brooklyn and Queens. I'm not too concerned about these bikes: they're recovered quickly, and Citibike may wish to treat the bikes' eventual destinations as a kind of desire line. Clearly Crown Heights residents can't wait for the program to expand to their neighborhood.

In July, the 109th precinct proudly reported that their detectives had detected a 68 year old man riding a Citibike which had been liberated and repainted. The suspected thief was detained and his ride confiscated. That's what I'm looking for!

So I got myself the trip data, put together a quick-and-dirty python script, and identified the first and last trip for each bike in the system. I presume that if a bike is stolen or destroyed it will disappear from the trip data, so we can guess that if a bike hasn't been ridden in some time, it's likely gone AWOL.

Note that the bikeid field in the data doesn't appear to match the number stenciled on the bike's frame. It could correspond to the electronic identifier (probably an RFID tag) which the stations use to identify bikes. If that's the case, missing trip data could simply indicate that the electronics were damaged and replaced.

There are 6943 unique numbers in the trip data. This is roughly consistent with a New York Times story, published when the program launched, reporting 6,000 bikes in the system.

If we sort the bikes by their final trip, we can quickly get an estimate of losses.

MonthFinal ridesMonthFinal rides
2013-07182014-0126
2013-08362014-0260
2013-09172014-0368
2013-10312014-04428
2013-11412014-056186
2013-1232
The vast majority of the bikes showed activity in May 2014, meaning that they weren't stolen or lost. Before April, each month saw between 17 and 68 final rides, averaging 36.5 each month.

At first glance, April appears to have been a disastrous month for Citibike thefts. But a more likely explanation could be that those bikes have been removed for maintenance. If Citibike keeps 300-400 bikes in their warehouse for routine tuneups, and if it takes two months for the bikes to rotate back out into service, it could easily explain most of the 428 bikes which were ridden in April, but idle in May. We would expect most of them to return to service in June. 

February and March also saw higher than average losses. Perhaps bike thieves are more active in those months, but this may be better explained by the unusually snowy winter. Plows, for example, may have taken a toll on the fleet.

Citibike can, at least in theory, bill a rider $1200 for failing to return a bike. If they collected this fee for each bike which went missing before April, they'd have raised nearly $400,000. However I've yet to hear of any rider receiving such a bill.

Assuming that these final trips do represent theft and loss, approximately half of 1% of Citibikes are lost each month, or about 6% every year. That's far better than the reputed 80% of Paris's VĂ©lib' bikes which were stolen in that system's first year!

Update: the original version of the table showed 116 trips which ended in June. This is because there were a handful of trips which started on May 31 but finished after midnight, and were thus credited to the next month. To make it less confusing, I've merged these final rides into the data for May.

Monday, June 3, 2013

Variadic macros and trailing commas

One of the great new (ok, 14-year old) features of C99 is the variadic macro. This feature finally provides a portable way to write a logging macro, for example. Here's a simple example:

#define LOGIFY(format, ...) \
    fprintf(stderr, "LOG: " format "\n", \
    __VA_ARGS__)
...
LOGIFY("Found %d widgets.", widgetCount);
...

However, there's a subtle problem here. What if we call LOGIFY with no variadic arguments?

LOGIFY("Finished.");

This expands to:

fprintf(stderr, "LOG: Finished.\n",);

And that doesn't compile, because there's a trailing comma.

GCC provides an extension to the language which works around this problem. For that compiler, you can use token pasting to make the comma magically disappear:

#define LOGIFY(format, ...) \
    fprintf(stderr, "LOG: " format "\n", \
    ##__VA_ARGS__)

However this is non-portable.

StackOverflow suggests that you can make the format part of the variadic part of the macro:

#define LOGIFY(...) \
    fprintf(stderr, "LOG: " __VA_ARGS__)

However this doesn't work for my example because I want to be able to past a new line onto the end of the format string. Since the format string isn't isolated from the arguments I can't append anything to it.

An ugly work-around is to simply pass in an extra, unused argument if you have no real arguments:

LOGIFY("Finished.", 0);

It's always harmless to pass extra arguments to a variadic function, and this does compile, but it doesn't seem very elegant.

If the ANSI C committee ever revisits this, or if I design my own language, they (or I) could resolve this problem by permitting trailing commas in function calls (and function declarations). C already permits trailing commas in array initializers (as do Java and other C-like languages):

int fibs[] = {1,1,2,3,5,8,13,};

This is convenient for a number of reasons, including conditional inclusion (#ifdef) and source code generation. It can also be a good habit to always include a trailing comma, especially when declaring arrays of strings. Without a trailing comma, there's a risk that a developer might add a new string to the end of the list without a comma. Since CPP does string concatenation, the results may be unexpected!

const char * messages[] = {
    "No error",           // 0; added in 2002
    "First legacy error", // added in 2002
    "Other legacy error"  // added in 2002
    "Brand new error"     // new in 2013
};

The same problem can occur with numbers if sign tokens (+/-) are used.

Why not support trailing commas in functions, too? It would be more consistent, and there are a number of not-immediately-obvious advantages. In addition to this variadic macro case, consider the case where one or more of the arguments is conditionally included:

void
doWork(
    int howMuchWork,
    WorkType whatKindOfWork,
#ifdef THREAD_SAFE
    bool useLocks
#endif
)
...

This doesn't compile if THREAD_SAFE isn't defined because there's an unused comma after the second argument. If C permitted trailing commas here we'd avoid awkward constructs like moving the comma into the conditional code (and even that wouldn't work if each of the arguments were conditional!).

Although it's not consistent with the normal usage of commas in prose the trailing comma is surprisingly useful in programming languages, and ought to be supported more widely.

Monday, March 25, 2013

Square Root: My solution

Wouter Coekaerts's square root puzzle presents a difficult challenge. We need to figure out the number he's thinking of without seeing the number. The only question we can ask is: "is it this number?" (In fact, it's a bit worse than that: when we ask that question, he just writes the answer on the console if we're right.) There are a few possible toeholds here where we could get started. I'll discuss some I rejected before I get to my actual solution.

Wouter gives us the source code for the answer() method, so we know how he's testing the answer. We also know how he generated the answer: a BigInteger of approximately 10000 random bits generated from a SecureRandom.

The easiest approach would be to just look at the answer. However the answer (actually the square of the answer) is stored in a private field. We can't get at it without using reflection, JNI, sun.misc.Unsafe, or something else of that nature. The challenge doesn't permit those types of attacks. What about accessing it as an inner class? Inner classes can access private fields of their enclosing classes, so could we trick the JVM into thinking our code is an inner class of square.Square? You can't do this, because inner classes are just a fiction created by the compiler. It might look like you can access private fields of your enclosing class, but the compiler actually adds hidden accessors for those fields which it calls under the covers. This approach won't work.

What if we could influence the answer? Can we control what SecureRandom does? We can do this by providing a different source of entropy from the command line or by modifying /dev/random. However I'm not sure that these can be done with a security manager in place or without root access, and I think that they violate the "no security vulnerabilities" spirit of the challenge.

One promising approach relies on the lack of a final modifier on BigInteger. What if we don't pass in a BigInteger for the answer, but pass in our own subclass of BigInteger? For example, could we override BigInteger.equals() to always return true? This looked like a promising line of attack, and I think it might actually work in some implementations. Unfortunately, the way Wouter wrote answer() he doesn't directly invoke any methods on the candidate object we control and OpenJDK's implementation of BigInteger.divide() doesn't call any methods on the divisor object either, so there are no opportunities there (however GNU Classpath's Biginteger.divide() does call methods on the divisor). BigInteger probably should have been final, but unfortunately this oversight doesn't seem to be sufficient to solve this puzzle. (If answer() had been implemented as root.equals(n.divide(root)) instead of n.divide(root).equals(root)this solution would have worked well, because our own equals() would be invoked.)

If we can't subclass BigInteger, can we just replace it? We can write our own implementation of BigInteger which always returns true for equals, but we'd need to replace the built-in one. This would require modifying the bootstrap classpath with -Xbootclasspath/p:, which I think violates the rules, or at least the spirit, of the challenge.

What about a simple brute force approach? I was actually tempted to submit this. We know the upper bound of the answer (210000), so a simple for loop ought to suffice. Assuming we can test one answer per nanosecond (highly optimistic!), wolframalpha estimates that it would take 6.322×102996 years, or 4.6×102986 times the age of the universe. The solution is provably correct, if somewhat impractical.

When I was looking at the implementation of BigInteger.divide() to evaluate the feasibility of the subclassing attack, I noticed that there are a few special cases at the top of the method. Consider a/b. If a = b, the result of integral division will always be 1. If a < b, the result of integral division is always 0. If a > b, the code is much more complicated. This suggested that a timing attack might be feasible: we should be able to determine if our candidate is less than Wouter's recorded number by how long it takes for the answer() function to complete.

Note that this attack is dependent on a number of things. We know how Wouter is testing the answer: n.divide(root).equals(root). If he'd done the more efficient n.equals(root.times(root)), instead, this attack wouldn't work (or would be much more difficult). It also relies on the JDK not doing unnecessary work for division when the dividend is less than the divisor. In Java 6 the JDK still calculates the remainder for this case (and discards it) making the attack more difficult but still feasible. In Java 7 the timing difference is more pronounced.

Timing the function is also quite challenging. For one thing, JVMs use JIT compilers which may compile code multiple times, with increasing levels of optimization, as they discover what's important. Therefore it's important to give the JIT plenty of opportunity to compile the code before we start trying to time it. I did most of my initial testing using -Xint, which disables the JIT, to eliminate this variable until I was confident that the solution worked. Garbage collection can also interfere with the timing, but it can only increase the time. If we only look at minimum times, and do enough measurements, it shouldn't matter. Other programs running on the machine can interfere with our timing, as can power management features. I tried running the solution on my MacBook Pro at first, but the timing was all over the place. In the end I used an isolated Linux x86-64 Xeon X5690 (Westmere) machine and Hotspot Java 1.7.0_09_b05 to test the solution. It was much more reliable.

Once the JIT is warmed up so we can get reliable timing, the next step is to calculate a base line. I do this by generating my own, random, 20000 bit number (the square of a 10000 bit number is a 20000 bit number) and testing how long it takes to divide that by number slightly lower than it. I repeat this a million times and record the minimum time it took to run my own answer() function, which is identical to Wouter's. Now that we have the baseline for a 'slow' divide.

Someone with a better statistical background that me could probably come up with a robust way of distinguishing between 'fast' and 'slow' division. I just used a heuristic and ran the test many times. If the divide time is >= 75% of the fastest baseline time, I consider it slow. If it's <= 50% of the fastest baseline time, I consider it fast. If it's somewhere in between I run the measurement again until I get a conclusive result. I short circuit for fast times, since I assume that these must be fast due to the number being tested, but not for slow times, since they could be slow for any number of the reasons discussed above.

Then I start to identify bits. I start with the highest possible bit and work my way down. Whenever I discover that the divide is fast with a particular bit set, and slow with the same bit clear, I know that we've identified the next bit in n. When the first stage of my program finishes (it takes about 80 minutes), it should have identified the highest number for which answer() is 'slow'. In other words, we know n-1.

Obviously determining n from n-1 is trivial, but Wouter doesn't actually want us to find n: he wants the square root of n. Unfortunately, there's no built-in library function to do this. I could have implemented Newton's method to solve the square root, but fortunately someone else already did this for me:  BigIntegerMath in Google's Guava has a very efficient implementation that handles a lot of corner cases I probably wouldn't have bothered with. Wouter wanted the solution in a single class, and I didn't want to rely on libraries not included with the base JDK, so I copied the relevant code from Guava. Since this is an Apache licensed project, that's conveniently legal.

My program runs in about 80 minutes on Java 7 on an Intel Xeon Linux server. The actual time depends on the ratio of 0 bits to 1 bits in n. 0 bits can be identified much faster than 1 bits. It may or may not work on other machines without tweaking some of the timing parameters. There's no way to tell for sure from within the program if it got the right answer, since Wouter just prints something to the console, but I do check to see if it's feasible. Most integers have irrational square roots, but we know that n is a perfect square. Therefore I test that the number I identified is also a perfect square. If it's not, I repeat the whole process again. A more sophisticated solution might track a confidence level for each bit and try remeasuring those, first, but it's always found the solution on the first attempt.

You can find the source code at github.

Thanks to my colleagues at Two Sigma who provided useful insights into this problem. Andrew Berman and Yaron Gvili suggested attacking SecureRandom, either directly or via /dev/random and /dev/urandom. Yaron also discussed the brute force approach. Isaac Dooley steered me away from mean and towards min and also explored ClassLoader based solutions. Trammell Hudson encouraged me to continue with the timing attack even when initial results were frustratingly ambiguous. Another colleague who shall remain anonymous suggested I prove that P=NP, and then implement a polynomial-time solution.

Wednesday, March 20, 2013

Java Puzzle: Square Root

I'm working on a solution to Wouter Coekaerts's square root puzzle. It's a challenging problem and will test your knowledge of Java and JVMs. I'll post my solution next week, but here's a hint about the direction I'm taking: Wouter's answer() method isn't the most efficient way to test for the correct solution.

Tuesday, October 30, 2012

Kobo surgery

A few weeks ago my Kobo Wifi e-reader suffered an unfortunate accident. I got caught in a thunderstorm and the pocket of my messenger bag filled with water. (Good: my Brooks Barbican bag is waterproof. Bad: water can't get out once it's in.) It was a few hours before I noticed that my Kobo had been partly submerged for an extended period. The poor thing just displayed a plaintive "Please Charge Your eReader" message. Pressing the power button caused some lights to flash, but no activity on the screen. It wasn't completely dead, but it wasn't working either.

Coincidentally, the next day my friend Roo tweeted that he'd dropped his Kobo and broken the screen. I suspected that between the two of us we probably had enough parts to build a working Kobo, so I arranged to collect his e-reader so I could try to rebuild one.
Dead board (left) dead screen (right)
The first step was opening the cases. The Kobo cases just snap together, so you can pry them apart fairly easily. Be careful as you risk cracking the case if you bend it too much. I damaged the white case a bit but was able to get the black one off without any problem. The board is attached to the rear half of the case with four Phillips screws which are easily removed.
Naked Kobos

The Kobo is fairly simple inside. The e-ink screen is mounted on the circuit board and connected to it with a flexible flat cable which wraps over the right-hand side of the board to a plug on the reverse. There's a lithium-polymer battery in the lower left hand corner, and you can see the 5 buttons of the rubber navigation pad in the lower right corner.
Removing the screen.
It took me a while to figure out how the screen was attached. I was worried that it might be glued to to board. Fortunately, it's quite easy to remove once you know that it's only attached with four strips of double sided tape.

First, unplug the flexible connector, but be careful! The plug has a plastic clamp to hold the connector in place. These break easily and without the clamp you'll get a poor connection. You need to disconnect this so that you can have unobstructed access to the edge of the screen.

I found that a thin utility blade slipped easily between the screen and the board. Just run this around all four edges and you'll loosen the tape. The tape will re-adhere pretty quickly, but it's quite easy to pry the screen off with your fingers once you've cut through the tape.
Screen detached. You can see where the four pieces of tape were.
Once both screens were off it was a simple matter to swap them. The boards also have Micro-SD cards which are used to store your books. (This is in addition to the SD expansion slot at the top of the card. This means that if you want to expand your Kobo's capacity you could probably easily replace the 2GB Micro-SD card with a larger one.) I swapped the Micro-SD cards, too.
It's alive!
While it was open, my friend Trammell pointed out that there were some connectors near the navigation pad which looked suspiciously like a serial port. (They were marked Tx and Rx, which was a bit of a give-away!) We connected it to a terminal emulator and were able to watch the Linux kernel boot as the Kobo powered on. It seemed a shame to hide that in the case, so before I put the case back on Trammell helped me make a few small modifications. We added a small hole to the case and soldered a connector onto the serial port.
Serial connector. From top to bottom: V, Tx, Rx
I haven't played much with the serial port yet, but I expect it might expose some interesting opportunities.

After swapping the Micro-SD cards the new Kobo showed all of my books in my library, but would only let me read some of them. This suggests that the DRM scheme is tied to some serial number on the device, and also that not all books are protected by DRM. I did a factory reset on the Kobo, connected it to my laptop and resynced all of my books. This made all the books readable again.