Archive for the ‘system administration’ tag
A Story of a Bug
It started, as it should, with the belief that it was my own bug.
I’ve been working for the last few days to generate daily summaries of the activity flowing through our tools. It’s nothing earth shattering, but it’s been a stepping stone to understand a little bit more about the Rails framework - and gave me the chance to begin experimenting with the Google Visualization API. Toward the latter part of the week, there was something a little odd with the “total valid” numbers with the daily account creations - I had made data changes to make sure I had some idea when accounts were vouched for and when they had been retired - so I naturally assumed it was something I did. I even went back and modified the model to make it more consitent with it’s peer models. And kept running the script that produced the daily stats in the Einstein-esque insanity of the doing the same thing twice and expecting different results. After about a dozen combinations of DATE(date_column) comparisons - I went to google, because I knew by then I was either going crazy or this was a legitimate “it’s not my problem” bug.
Which led me to this mysql bug. Reported July 19, 2007. Apparently introduced in MySQL 5.0.42 (May 2007) when the DATE comparison changed from strings to ints. Fixed within two days as part of 5.0.48 (Released August 2007).
But guess which mysql package Red Hat EL 5 (well, RHEL5 update 2) provides? - right, in between. It’s MySQL 5.0.45. And the forthcoming RHEL5 update 3 release doesn’t update MySQL either.
Development and System Administration is a weird, weird world. I use RHEL, not for support (I’m not even sure we have support with the University contract), but to have some degree of patch level stability that’s slightly longer than the fedora releases (and at the time I went to RHEL, Fedora was still dealing with it’s Red Hat transition) - but that stability comes with the price of things like this. I already use my own Ruby to get beyond the base install, but configure, make, make install for one piece of core software is a little different than dealing with it (or MySQL-supplied RPMs) for other software.
I’m glad the open source world gives me that choice, but open source + my labor + thousands of moving parts does give provide the reality that even when a bug is fixed two days later, in the open, patchable by everyone - that sometimes you can find yourself over a year later modifying your own DATE queries so that they don’t include nulls.
So that’s the overall summary of the post I guess - part of it to go into google that if you are getting odd MySQL DATE function results on MySQL 5.0.45 on Red Hat Enterprise Linux (RHEL 5) - it’s a bug. And it’s fixed. But not included in RHEL 5. And if you aren’t getting odd results with DATE comparisons - you probably don’t know that you are.
And maybe one part as a lament to that inevitable ongoing intersection of thousands of moving parts in every environment, not the least of which ours. And you trade off replacing mysql on multiple servers and just turn nulls into zeros (which then breaks your signup form that desparately needs an overhaul) - well, because it seemed to make sense at the time.
And people say you don’t use probability after college.
I bet the arms of those electrons are tired
I use a combination of monit process monitoring, simple icmp checks, and a lot of log checking to monitor most of our extension.org systems infrastructure. It isn’t fancy, and it’s not quite the visibility I want, but it’s good enough for where things are at right now. Most of proper monitoring isn’t about getting data about things as they are, but things that have changed from what they just were prior.
C is for cookie, and icmp’s good enough for me, or something like that.
Anyway, ping checks to the hosted server last night from the NC State monitoring box started failing last night - which is pretty much unheard of in normal operation. And in shelling into the box (a shell into NC State from Road Runner and then over to the hosted server) - showed the box was just fine itself, but the shell session was either dropping packets, or the latency was really, really bad because it was next to unusable. HTTP seemed ok. But I didn’t spend a lot of time browsing to see if it was affected all that much.
I broke out traceroute, not because it’s all that great of a troubleshooting tool, but it at least gives a relative glimpse if the echo times go up somewhere in the middle. From NC State to the hosted server - things looked, well, a little odd (some times left out to fit it in my content div for the blog)
traceroute to www.extension.org (206.252.150.82) 1 152.1.215.1 (152.1.215.1) 0.643 ms 2 vl2942-cmdfcore.ncstate.net (152.1.6.165) 0.512 ms 3 ncsugw2-gi2-1.ncstate.net (152.1.6.78) 0.592 ms 4 rtp7600-gw-to-ncsu-gw-2.ncren.net (128.109.70.45) 2.064 ms 5 rtp11-gw-to-rtp7600-gw-sec.ncren.net (128.109.70.121) 1.932 ms 6 ge-6-24.car2.Raleigh1.Level3.net (64.158.228.1) 3.291 ms 7 ae-6-6.ebr2.Washington1.Level3.net (4.69.132.178) 9.107 ms 8 ae-62-62.csw1.Washington1.Level3.net (4.69.134.146) 9.227 ms ae-72-72.csw2.Washington1.Level3.net (4.69.134.150) 12.561 ms ae-82-82.csw3.Washington1.Level3.net (4.69.134.154) 16.460 ms 9 ae-4-99.edge2.Washington3.Level3.net (4.68.17.204) 9.987 ms ae-2-79.edge2.Washington3.Level3.net (4.68.17.76) 9.948 ms ae-3-89.edge2.Washington3.Level3.net (4.68.17.140) 9.909 ms 10 * * * 11 so-4-0-0.mpr1.iad2.us.above.net (64.125.30.118) 95.598 ms 12 so-4-0-0.mpr2.iad1.us.above.net (64.125.29.133) 95.853 ms 13 so-2-0-0.mpr2.dca2.us.above.net (64.125.28.129) 99.993 ms 14 so-0-1-0.mpr2.lga5.us.above.net (64.125.26.106) 106.443 ms 15 209.133.17.94.available.above.net (209.133.17.94) 104.147 ms 16 above1-colo1-cr-4-g1-0-25.core.logicworks.net (206.252.139.214) 103.555 ms 17 extension-fw-pri-eth0.complex.logicworks.net (209.73.6.34) 103.962 ms 18 * * *
I’ve never actually seen multiple servers on a single line (lines #8 and #9) - but it’s not like I live in traceroute. Combined with line #10 - something looks a little weird at the edge between level3 and above.net. echo times go up 10 fold, and there’s some oddities there.
Things get really weird on the return trip:
traceroute to systems.extension.org (152.1.217.221) 1 * * * 2 209.73.6.33 (209.73.6.33) 0.460 ms 3 above1-er-1-g1-5.core.logicworks.net (206.252.139.209) 0.350 ms 4 above1-er-2-ge-1-ec1-2.core.logicworks.net (206.252.140.146) 0.445 ms 5 142.ge-1-3-8.mpr2.lga5.us.above.net (209.133.17.91) 0.425 ms 6 so-1-3-0.mpr1.lhr3.uk.above.net (64.125.31.181) 86.449 ms 7 pos-8-1.mpr2.cdg2.fr.above.net (64.125.23.26) 100.880 ms 8 level3.cdg2.fr.above.net (84.207.21.10) 104.975 ms * * 9 * ae-32-52.ebr2.Paris1.Level3.net (4.68.109.62) 104.307 ms * 10 ae-44.ebr2.Washington1.Level3.net (4.69.137.62) 107.609 ms ae-41.ebr2.Washington1.Level3.net (4.69.137.50) 114.159 ms 11 ae-6-6.car2.Raleigh1.Level3.net (4.69.132.177) 104.059 ms 12 ae-11-11.car1.Raleigh1.Level3.net (4.69.132.173) 105.263 ms 13 * MCNC.car1.Raleigh1.Level3.net (64.158.236.2) 114.115 ms 14 rtp7600-gw-to-rtp1-gw-sec.ncren.net (128.109.70.82) 235.562 ms 15 ncsu-gw-2-to-rtp7600-gw.ncren.net (128.109.70.46) 107.384 ms 16 itcore-x-ncsugw2.ncstate.net (152.1.6.249) 114.190 ms 17 vl2941-mmdfhub-6509-1.ncstate.net (152.1.6.146) 114.431 ms 18 exvm-217-221.ces.ncsu.edu (152.1.217.221) 115.972 ms
So if I’m reading those hostnames right - and they are an accurate reflection of geographic location - it sure ain’t Paris Texas. Traceroute got routed from New York to the UK to France, and back to D.C.
WEIRD.
Last night, I thought this was above.net’s problem, today I’m not so sure, I’m thinking it was Level3’s. I wonder if stuff like this actually gets reported anywhere.
Today, things look like you’d expect.