Regular expressions for internal auditors

Back to the past

I still remember the time when we had to audit large data sets by hand. You may think that is a long time ago, but I’m talking about the mid to late 1990s. That was in my youth as an auditor. I remember spending hours waiting next to a large printer that was crunching out reams of paper which I knew I had to tabulate and check the totals for. We did that by adding those columns by hand, using calculators with really large buttons. Those buttons were there because it was so easy to miss-type. And if that happened, you could start all over again.

How representative were our tests? I wouldn’t feel too comfortable about them now. But that was all we had at that time. Large reams of paper, a calculator and time.

And on to the future

Data and data sets have significantly evolved over the past 15 to 20 years. Enterprise resource planning systems or highly complex accounting systems are commonplace. And it’s not only financial information, but also operational information that is getting captured more and more in extensive data sets.

Internal auditors have evolved with the system’s evolutions. There is a complete field in auditing which does nothing but auditing IT systems. Of course, as a knowledge manager, director, or chief audit executive, it becomes difficult to graps the issues underlying problems identified by these experts. Which is why I believe that as an internal auditor we need to be ready to dive into this ocean of data. However, do this without adequate preparation and you are likely to drown. And we do not want to happen, now do we?

Auditors need to get their hands on the raw, core data

If we really want to get hold of the information present in those large data sets, we need to get our hands on that data. There are a couple of ways to do it, but let me distinguish the two big ones: the right way and the wrong way.

The wrong way

Let’s start with the wrong way first. The internal auditor asks the ICT department or the responsible ICT experts in the accounting department to provide the data, cleaned out and prepared for review according to a specific format, on a server somewhere. There’s a number of issues with this. First, if you ask a third party to clean data, you stand to lose a lot of highly relevant information. The information provided to you by these experts likely has gone through a number of manipulations. It is not the original data. It is some reflection of that original data. And if the data was pulled out of the system based on algorithms in the system, how will you ever know if those algorithms did not influence or change the data? In short, there’s too many steps between the actual transactional data when it is generated and data presented to the auditor. You can write scope exceptions as much as you want, there’s a significant risk that a number of errors will not be identified.

So, then, what’s the alternative? Is there an alternative? You do not want us to go and take millions of individual records and analyze them, now do you?

The right way

Well, in a way I would like you to do exactly that. Or at least something rather close to that. You need to go to the core of the data. The core is the original transactional data that is generated the moment a transaction occurs in the system. That transaction can be anything. It may be an accounting transaction, a stock receipt, a personnel movements,…

Okay, but wait, even for a smaller organization that means literally hundreds of thousands of transactions. I know. The reason why you need to get as close as possible to the core data is because you want to avoid issues when that core data is manipulated in the system. Remember, we wanted to avoid that black box feeling where all we had is belief that our experts are expert enough.

So you need to get a data set of the raw transactions. The problem with these data sets is that they are quite often not very clean. It’s likely that you can get the data in a CSV or comparable format. if we’re talking about millions of transactions, excel will not let you work with these. And even if you use a program such as IDEA, like we do, you need to make sure that you clean up your data set. Ideally, you select those elements you need within that data set and use that for your analysis.

An easy way to work with that raw data

And there is an easy way to do this. And that way is called regular expressions. “A regular expression is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. find and replace like operations,” as Wikipedia defines it. Most text editors support regular expressions and most CSV files can easily be imported in a text editor. Using regular expressions actually comes down to using the find and replace functionality of your text editor as a one or two line programming engine.

Wikipedia provides a number of simple examples. You can for example identify and select all instances where the name of a person is spelled in different ways, that way ensuring that you have all transactions related to that person and not only the transactions linked to a certain spelling of his or her name.

Another thing you can do it regular expressions is to replace multiple spaces with comma’s, ensuring that import in other programs for further analysis is clean. Regular expressions can even be used to do extensive searches on certain occurrences within a data set without need for a complex program other than a good text editor.

And best of all, regular expressions are not difficult. If you can state the problem in ordinary English, and you know the simple, basic concepts of the syntax, which is not difficult at all, it is just your creativity as an auditor that will limit what you can do with that raw data set in your hands.

The sky is the limit

Imagine that, just by performing a limited number of these tests, you can get additional assurance that the work done by your highly paid consultants which came in to help you analyze that new ERP system, are actually worth what you pay them. That’s quite a strong use case. Combining regular expressions with tools such as IDEA or other comparable tools can liberate even a small internal audit department from its high degree of dependency on third-party suppliers for basic data analysis. That means that the budgets available can be used to ask the difficult questions of your consultants. And that’s a good use of money.

An accessible book

I recently bumped into a short, but highly accessible book on regular expressions. Do not be fooled by the name or by the fact that it’s still being developed. It is well-written and it has helped me to get my head around this amazing set of tools, for lack of a better word. The book is called “the Bastards Book of regular expressions” which was written by Dan Nguyen. in principle you can download it for as little as zero USD, but I would suggest that you consider the inherent worth of the book and pay the author for his work.

I know that regular expressions will really change the way that we deal with data in our audit environment. I hope it will assist you to. Enjoy the learning!