the cups blog

07-23-08

Talk: Research Directions for Network Intrusion Recovery

Authors: Michael Locasto, Matthew Burnside, Darrell Bethea

Realized these areas affect people more than themselves and would like feedback on these topics and research. Network intrusion and discovery is underappreciated on the recovery side, seen as boring system administration work.

Focus: usable systems for intrusion response. One benefit is an incident archive. Orgs & people have disincentive to share incidents, bad PR, so there’s no archive to research to see which problems to focus on. Network intruction recover is difficult for researching, designing & creating usable security mechanisms.

Started logging incidents, march 07, dec 07, march 08. Talking about dec 07 rest are in the paper.

Graphics research group in CS got 4 new machines with nvidea cards and unofficial drivers. IT staff installed non-standard drivers all was good. 12 months later, machines crash on 12/6/07. Added to ticketing system, IT staff rebooted them everything seemed fine, Monday crashed again on 12/10. Finals week – didn’t start diagnosis until 12/13. Two rdist masters, and they start to crash as well on Monday 12/10. Recent kernal upgrade – roll back. Crash again on 13th and 14th. On the 17th, compile the kernal not just apply the binary. The make failed because it could not create directories that just have numbers. Sounds like – rootkit, might be intercepting file ids. This is the first time we think it might be a security issue. Booting from CD shows that common utilities were replaced. Every machine managed with rdist — 200 machines! — has been compromised. And then the staff goes home for the holidays at the end of the week. Plus, Friday the 21st half the staff is leaving for new jobs. Switched OS to a different Linux flavor, changed everyone’s password and sent text messages to everyone to go out-of-band.

Lessons learned

There was no recovery agenda. Multiple conflicting points of view. Masters students running much of the show, no one’s there long term. Decisions are informal and qualitative: why switch from RedHat? The swaying argument was the person doing the install was “comfortable with the package management system.” Why is that the right factor for security-related decisions? But the RedHat advocates had moved on from the group, so now people wanted to install what they knew. How do you create and update a plan in the face of so much churn? Reviewing once a year isn’t going to be enough. How do we do this in a useable and efficient way? Human memory is pretty bad. People involved in multiple incidents confused what happened when. There wasn’t clear record keeping. IDS systems don’t work. The rootkit conflicted with unofficial video drivers, the machines crashed. In another incident, NFS mount failed. Even when Snort is turned on, who’s going to look at 500 messages a day? The infrastructure is weak. And the human level issues complicate things even further.

Tension about forensics: do you keep a machine up or take it down?

Staff and ISP might want to take the machine down you don’t want a reputation as spreading a worm. But might want to keep it up to figure out what’s going wrong and to be able to fix it. Users want to stop the threat to their privacy, but if it’s a critical machine — or during finals — it may not be possible to take down. 8 months later there are still machines vulnerable to the same attacks.

Research directions

  • Not just technical but human problems. First approach, bulleted list of what could be possible. But doesn’t get across interactions. Used Tufte as a starting point for visualizing a “decision surface” to help plan out activities and see where complexities lie.
  • Predict latent vulnerabilities based on what you’re already learned.
  • Recording infrastructure with “recovery trees” and figure out how to integrate with current tasks, need a system woven into the infrastucture
  • Technical comparisons of alternatives: NLP on release notes, query bug databases, etc.

Conclusion

Community should focus on creating mechanisms that deal with recovery as a system of both humans & computers

Q&A: recovery trees could help with things like where are the LDAP servers and what happens if they fail. A: Need to know how things work now, and that’s hard with 25 years and no notes. You need a system that can figure out where things are.

Q: interesting when stuff breaks, we giggle when grandma says “my computer doesn’t work it must be a virus.” How many times does stuff break that isn’t security? A: Don’t know, probably most are not security. You dig when you find a symptom, maybe network is slow.

Q: when building a db of incidents, two problems: kind of problems people have may be so different they can’t find anything useful; organizations aren’t highly willing to share. A: People do experience the same sorts of things so there is value to compare notes. Also value for research, especially as people bring new tools into play, can evaluate them. Second part: have to get friendly with sysadmins. They’re willing to share them, you have to talk to IT directly.
Q: incident response varies by inside or outside threat. Any data on %s? In your case, was it inside or out? A: we don’t know suspect outside. Don’t have data. Even defining what an insider is gets hard. See verizon report: http://securityblog.verizonbusiness.com/2008/06/10/2008-data-breach-investigations-report/

Q: was the driver the threat? A: no, just the canary that showed the rootkit.