Logs 101, or how I learned to stop worrying and love the log
Back in my early days in the industry a grizzled veteran imbued in me the importance of logs, and how they will save your backside time after time after time. These were the days when “have you turned it off and back on again” were my default position.
We all started somewhere, right?
Anyhow - I’m a few years down the line and I totally understand why that grizzled vet was so keen on logs. And this blog tries to explain why. Please bear with me.
So your mail server is bouncing mail. How do you know why it’s doing it?
So your website is responding slowly. How do you know why?
You’ve a user who is getting an “Access Denied” message. Why is that?
There are a number of tools that you can use to help you to troubleshoot. There are tools like perfmon and resmon, as well as the Sysinternals suite on Windows. Tools like top on Linux.
Generally however the best first place is to look in the log files for the application or system in question - some examples of this are Event Viewer on Windows, /var/logs on Linux or any number of flat files - refer to your application vendor of choice to determine where your problematic application is writing it’s log data to.
What you absolutely shouldn’t do is invoke the traditional IT Helpdesk clause of turning it off and back on again. Any number of applications will store their logs in memory (rather than on disk) by default, and a reboot flushes these logs, and leave you with no useful troubleshooting data. This is Bad, with a capital B. If anyone tells you to reboot as a first step in troubleshooting report them to the nearest responsible person, and never trust their judgement again.
As you grow in experience you may hear people mention Syslog. Syslog is amazing: instead logging all of your data to RAM or local disk (which is almost as bad, for reasons that I’ll explain shortly) it ships all logs to a remote server. This disconnect between the server that’s running the workload and the server that’s holding the logs is important for a number of reasons:
If you suffer a security breach the first thing any hacker worth their salt (and let’s face it: you don’t want to be getting breached by script kiddies!) will do is clear the logs. This might ring an alarm bell if you’re doing protective monitoring of your logs (and if you are: this post is teaching you to suck eggs and nothing more, but thanks for reading!), but unfortunately at that point all of your forensic data is likely gone. How do you know what the hacker has done? What data has or hasn’t been compromised? Which customers you need to send an apologetic email to?
It means that you can run analytics on all of your log data. It’s much easier to know when there’s been a breach, or trace your performance issues if you have all of your logs in one place. if you have 100 web servers and you’re seeing performance issues on those in a specific subnet it’s all of a sudden easier to tie the issues back to a common cause - likely the network (because it’s always the network, right?)*
Have you ever met a Security Officer? They’re really keen on separation of privilege, and for a very good reason - the human is almost always the easiest aspect of a system to compromise. Separate your production server logs from your production servers and they will be Very Happy People. You can also then run a system where your security team are able to have access to all of your logs, and that said logs can’t be compromised by the weak piece of wetware, intentionally or not.
*In this case it was the storage. Sorry about that, network admins.