Input/Output Data To Your ML Projects In Python
Long Term Storage!
An important problem in the creation of training data is the provision in the form of formats, which should be supported as long as possible. In the Python ecosystem it is always Numpy, an array form that can be easily loaded into C++ 2. But I would personally not like to give away a collection of Numpy arrays, so I would always advocate giving it away in the form of a HDF5 system 3. This data system can be distributed and used in different languages, like Python 4.
Reading a lot of files from within a directory into Python?
I personally work and live completely in the Linux world and therefore most of the tips are related to this area. One of them is the following code snippet. Here you should load all JPEG files in the folder into a list and from this list into a Numpy Array.
While I have used such a similar code snippet X times already, an error occured:
Inside the folder I had more than 1300 files which exceeded the internal limit of 1024 files in the system at this point. To get around this limit on a Unix system is relatively easy, as one can see from the snippet below:
Basically, the memory utilization of the data we use is an important thing. If you look at the code snippet above for loading data into a Numpy array, you will notice that both the big list and the array are in RAM. I strongly recommend a functional way of working here, which returns only the desired part within the namespaces. This means that all parts are needed during the creation of the data, but only the desired part remains in memory after the function call. This is important for working together on bigger remote machines!
Saving all data in HDF5 format
This step is not always easy but it is enormously worthwhile. This is exactly the point where science can be accelerated. Currently we are also doing far too little in the sector of data sets in the review process. There are more and more voices that papers on pure datasets are currently being rejected by major conferences, which is incomprehensible to me! Just look at the citations for the MNIST handwritten digits 5 and you will quickly realize how important curated datasets are in AI/ML. The gradient based learning methods of modern deep learning is currently the easiest to look at from an empirical point of view. If we know the properties of a data set exactly, we can better draw conclusions about the methods. But back to the creation of data in the form of HDF5 data:
Reading Data from HDF5 format
If you have an unknown HDF5 format, the first thing to do is to check the names of the records in it:
The step of reading and thus the actual memory usage is done like this: