1/16 Okay, I have a technical question here. I administer a cluster
of java webservers running jetty (not resin, sorry). They are behind
a load balancer. They used to start up fine under "load" i.e. in
the load balancer configuration. The latest code release
broke this somehow and now they crash upon startup unless
I remove them from the load balancer first. The programmers
promised to fix this, but of course they did not and now
they claim this is "industry standard." It is burdensome
to me to have to remove them and then re-insert them every
time I need to do a restart. Does anyone else run a java
webserver cluster? Do you have to remove them from your
load balancer everytime to start them up or restart them?
I already STFW and could not find anything on the topic. -ausman
\_ What I would do (I've been in similar situations):
1) Escalate to CTO so he knows #2 is coming:
2) "No, I can't bless this release because your P1 bug isn't
fixed. You'll have to explain to the CTO why this bug hasn't
been foxed." This will trigger a meeting where you can say:
3) "This is an industry standard? The code wasn't broken until
release X.Y.Z on date MMDDYY which you said was going to be
fixed. Find me the IEEE, IETF or other industry standards
body doc that says this is standard."
If you don't push back hard on this you're only setting yourself
up for even worse hassle down the road. I've 'worked' with code
monkeys like that before. They're classic bullies. Hit back hard
immediately and be rude about it unless you want to be their ops
bitch forever who has to kludge around their crap code.
\_ Unfortunately, there is a tradition of kludging around bad
code here that I am trying to change. Fortunately, we have
a new CTO who supports my general philosophy on this.
\_ Yes, and no. Since you didn't supply more information on how
you're accomplishing load balancing, I have no idea how to
fix your "problem".
\_ Netscaler.
\_ I assume it's something like a 9000 series. What's the
error you're getting when starting up resin? Can you get
a debug trace out of it?
\_ Yeah, 9000. All kinds of errors, untimately leading
to a server crash and restart which then crashes and
then tries to restart...
then tries to restart... I am trying to dig up the
exact error for you now. Actually, email me for details,
I don't want to post it on the motd.
\_ That proves it isn't "industry standard". Escalate and get the
programmers whipped into shape.
\_ They promised to fix it? Do you have that in writing or in a bug
tracking database?
\_ No, but they promised in the code release meeting, where I
have to sign off on code releases. I only agreed to let this
code go live on the condition that they would fix it later.
The CTO, who is in charge of both my group and programming,
was there. So I definitely can push back if I want to, but
I need evidence to make my case. It is probably true that it
is less overall time to do the laborous restart than it is
to fix the bug, at least in the short term.
\_ Unless it's an architectural problem that will compound as
they continue to build on the existing architecture.
\_ Sounds like an excellent opportunity to set up a script to handle
updates. Take node out of load balancing, restart it, test that it
started cleanly, and then put back into balancing. Been there,
done that.
\_ nonononononono!!! do *not* *ever* kludge up something on the ops
side because your coding team sucks. Make them fix their code.
If you want to tweak around with LB'd node status to maintain
a 100% consistent site, for exmaple taking out half, updating
them, putting them in a new pool, switching the VIP to that
pool, then doing the remainder. Ok, I guess. You can be clever
for stuff like that if there's some need. But in this case, he's
dealing with lazy code monkeys who are trying to force an ops
policy change because they introduced a bug. They need to be
clubbed into submission. This will not be the end of ops policy
kludges to cover bad coding. He'll regret covering for them.
\_ bad coding happens. you can deal with it and catch the
problem before you have your entire production cluster
spewing garbage, or you can let the bad code mess
everything up and get into a finger pointing pissing match.
OPS job is to keep stuff working.
\_ bad coding happens. you can deal with it and catch the problem
before you have your entire production cluster spewing garbage,
or you can let the bad code mess everythign up and get into a
finger pointing pissing match. OPS job is to keep stuff
working.
\_ OPS job is to protect the site. That includes making sure
crap code doesn't get pushed and pushing back hard on the
developers if it does. His situation isn't finger
pointing. The devs screwed the pooch and need to unscrew.
Slapping a condom on afterwards isn't going to fix
anything.
\_ What's wrong with doing both? I assume he doesn't
have to TELL anyone he wrote the script. -!pp
\_ The short version: doing it right is better.
Don't lie about IT stuff. Get busted once and your
already shakey credibility (you're in IT, right?)
is shot forever.
\_ I dunno, the exact details of how IT does its
job is not really usually that interesting to
engineering. I don't think you need to tell anyone
about all your little operations scripts, but
don't lie if asked either.
\_ what would you say if someone asked you how
long it would take to write a kludge script?
it's a very likely question. also, in some
places OPS will be working with engineering
and be more aware of how long the different
steps are taking even if they don't know the
details. i don't know if that's jim's case
but anyway, i wouldn't go out of my way to
be too helpful in a situation like this. the
new CTO was likely brought in *because* the
board or CEO or whoever understands the code
base is broken. C*O changes aren't common.
if so, then OPS can help the guy do his job
which will make OPS future much happier or
continue down the same path to piling more and
more madness on top. my philsophy is this:
don't do anything you'd say was garbage if
someone else did it and you were the new guy
taking over that job.
\_ What load balancer? HTTP keep-alives work decently well for us.
If it's not answering to the load balancer, it won't be in the
pool. |