How to generate test files with Python

Edgar Montes
3 min readApr 26, 2021

The volume of data has been exponentially growing in the last few years and it will continue to grow in the upcoming years according to statistics from site statista.com. The data created, captured, copied, and consumed worldwide is estimated to reach 149 zettabytes in 2024. With this escalation in data growth, businesses across all industries have great opportunities to use this data to make data-driven decisions. The challenge, create efficient systems able to manage growing volumes of data. A key point of this challenge is being able to test these systems with real loads of data. Here is where the test files come into play.

Some challenges across many industries with test files are:

  • Non-availability at the right time for testing. It often happens when developing systems to manage data, the files from clients are not available until almost the time to go live. Such a thing puts so much pressure on teams to be able to adjust the systems to deliver and perform as expected.
  • When files are available on time, often only represent a small portion of a real load file. This produces a gap in real load size performance once the system starts production.
  • When improvements in the system are needed, the challenge also comes at the time to test the system with real-size loads of data.
  • In addition, regulations regarding protected information across industries restrict the data to specific environments.

Here is where the need to have real-size files to load and test the systems comes into play. In this blog, I propose a flexible solution coded in Python to generate test files to produce real loads of data to test systems. The code presented here can be escalated easily to produce several files, with several columns, with several records. All these with the benefit of not having to manage real protected information.

Here is the GitHub repository. This is a typical use case that applies for any type of industry where there is a file to store the following columns: DOB, address, first and last names.

  • First, let’s define the file name(of), the number of records we want to generate for the file(nor), and the delimiter(dlm).
  • Second, define the file layout in dictionary format key-value pairs. Place column names in the keys and in the values, place the functions to generate random strings.

Here is a brief description of the functions to generate random strings:

  • randomtimestamp(start_year=1920): This function generates a random date with a year between 1920 and the current year.
  • ran_str(string_length): This function generates a random string with length equals string_length. All characters are upper case.
  • ran_num(number_length): This function generates a random string of only numbers with length equals number_length.

I hope this code can provide you a solution to solve business problems. Feel free to connect and reach out for any questions on LinkedIn.

--

--