Parsing log files using Python and RegEx

Parsing log files using Python and RegEx

Learning Python for log parsing is very important in the security field, specifically for analysts.
I wanted to showcase exactly how this could be done using RegEx in Python and some example log files, so the exact steps I'll be taking are:

  • Reading content from the log files

  • Converting strings into lists

  • Updating the files

So to get started we first have a few log files.

As we can see we have 3 files - error_message.csv, user_statistics.csv and syslog.log
Now we can see that the files are in 2 formats - .csv and .log, python can take these 2 formats and can read, write or append to them.
For this example, we'll take the current files and use them to create new files ending with a _parsed at the end of their names, as to preserve the original log files.

Let's start by opening our text editor of choice, and start writing our script called parse_logs.py (or whichever name you'd prefer).
First off let's start by first importing our files into our program and reading them:

# Create our file variables
error_message = "error_message.csv"
user_statistics = "user_statistics.csv"
syslog = "syslog.log"

# Read the files and store them in a variable
with open(error_message, "r") as error_log:
    error_file = error_log.read()

with open(user_statistics, "r") as user_log:
    user_file = user_log.read()

with open(syslog, "r") as sys_log:
    sys_file = sys_log.read()

Important: make sure your Python program is in the same directory as the log files, if it's not make sure to specify the directory.

Let's make sure that our files have been read properly, I'll add print(sys_file) to my code and run it:

Great! now we can use RegEx to extract important information from our logs.
As we can see here we have our usernames in parenthesis at the end of each line, let's try adding them to a list.

import re

# Create our file variables
error_message = "error_message.csv"
user_statistics = "user_statistics.csv"
syslog = "syslog.log"

# Read the files and store them in a variable
with open(error_message, "r") as error_log:
    error_file = error_log.read()

with open(user_statistics, "r") as user_log:
    user_file = user_log.read()

with open(syslog, "r") as sys_log:
    sys_file = sys_log.read()
    sys_usernames = re.findall(r"\((.*?)\)", sys_file)
    print(sys_usernames)

Notice we've imported RegEx using import re and we've stored our usernames in a new variable called sys_usernames, after that we used the findall() function in the re module to find our pattern in the text, the RegEx pattern we used is \((.*?)\) so let's break it down:
\(: Matches the opening parenthesis '('.
(.*?): This is where the actual capturing happens - the parenthesis ( and )
are called a capture group, a way to extract a specific part of the matched text.
Next is the dot ., it matches any character except a newline character.
After that, there's the asterisk * which means to match the preceding character, which in our case is the dot, zero or more times, so together they match any sequence of characters.
\): Matches the closing parenthesis ')'.

Let's run our program to see if it worked.

Success! Now let's try extracting the dates, this requires a new pattern.
For this part, we need to identify the pattern of the dates which is 'Month Day HH:MM:SS', so our code would look something like this:

sys_dates = re.findall(r'\b[A-Za-z]{3}\s\d{1,2}\s\d{2}:\d{2}:\d{2}\b',
 sys_file)

First, we store the dates in a new variable called sys_dates, after that we move on to our pattern '\b[A-Za-z]{3}\s\d{1,2}\s\d{2}:\d{2}:\d{2}\b', let's break this down:
\b - returns a match if the requested characters appear at the start or end of a word.
[A-Za-z]{3} - Three alphabetical characters that correspond to shorthand month names like "Jan" or "Feb".
\s - a space (or whitespace) character.
\d{1,2} - This matches one or two consecutive digits, which corresponds to the day of the month.
\d{2}:\d{2}:\d{2} - this part matches our HH:MM:SS pattern, this matches two consecutive digits followed by a colon, this part is repeated 3 times.
\b - ending with the same symbol to ensure the match is complete.

Let's print this variable and run our program to see the results:

Fantastic, we now have a list of dates.
Now let's write these variables into a new file:

syslog_parsed = "syslog_parsed.log"
# Write to a new file - syslog_parsed.log
with open(syslog_parsed, "w") as sys_parsed:
    sys_parsed.write(str(sys_usernames) + '\n')
    sys_parsed.write(str(sys_dates) + '\n')

So what we essentially did was create a new variable called syslog_parsed which contains the string "syslog_parsed.log", this variable will name our new file so that we can preserve the original syslog.log file.
Now in our open() function we used a "w" instead of "r" is because we are writing to a file instead of reading it, there are four modes we can use:
"r" - Read - Opens a file for reading, if the file doesn't exist it returns an error.
"a" - Append - Appends for a file and creates the file if it does not exist.
"w" - Write - Opens a file for writing or creates the file if it doesn't exist.
"x" - Create - Creates a file.
After that, we used the str() function to convert our lists (sys_usernames & sys_dates) to strings so that we could add a new line '\n'.

Now let's try running our program.

Well, after running our program we have successfully created a new file, let's cat into it to see if our file contains our lists:

Voila! We now have a parsed file containing all of our relevant information.
You can now further go ahead and concatenate these lists and for example match the dates with the usernames, etc... but for this short demonstration we achieved what we were looking for - we read our files, turned the strings into lists and written them into new files.
I hope this was useful for some of you, this was an amazing exercise that helped me practice my RegEx and Python and further hone my skills.
And also for anyone looking to further enhance their knowledge of RegEx, https://regexone.com/ is an amazing resource I used to learn and practice this amazing tool, you also get to practice this using the Google Cybersecurity Certificate if you're interested.

And of course for reference, here is the program in its entirety:

import re

# Create our file variables
error_message = "error_message.csv"
user_statistics = "user_statistics.csv"
syslog = "syslog.log"
syslog_parsed = "syslog_parsed.log"

# Read the files and store them in a variable
with open(error_message, "r") as error_log:
    error_file = error_log.read()

with open(user_statistics, "r") as user_log:
    user_file = user_log.read()

with open(syslog, "r") as sys_log:
    sys_file = sys_log.read()
    sys_usernames = re.findall(r"\((.*?)\)", sys_file)
    sys_dates = re.findall(r'\b[A-Za-z]{3}\s\d{1,2}\s\d{2}:\d{2}:\d{2}\b', sys_file)

# Write to a new file - syslog_parsed.log
with open(syslog_parsed, "w") as sys_parsed:
    sys_parsed.write(str(sys_usernames) + '\n')
    sys_parsed.write(str(sys_dates) + '\n')

All log files were taken from this GitHub repository, full credit for these files goes to the author Shubham Goel