Do generators really reduce the memory usage?

  1. The confusion
  2. The articles
  3. The manual page
  4. Takeaway loops
  5. What generators can't do
  6. Comments

The confusion

Recently I noticed some controversy regarding memory usage with generators. It looks like that many people genuinely take generators as a tool that somehow can get them "a big performance boost" when working with large datasets. Sort of a magician's hat, where one can put any amount of data without affecting the PHP process' memory.

At first I wondered, how it can be? I mean, we were working with large datasets for ages before generators. The best article on generators, a decade old What Generators Can Do For You by Anthony Ferrara hardly mention any memory issues and explicitly states that generators "do not add any capability to the language". So it was my understanding too: generators is a great invention that has many interesting applications, but solving performance issues is not among them.

Hence my curiosity was aroused and I went to investigate.

The articles

It turned out that there is a lot of articles that falsely claim the ultimate role of generators in reducing the memory usage. Or at least emphasize the role of generators in optimizing the performance. But it's hard to write such an article without using some trick or another. For instance, a quite popular scheme is like this:

First, state a real issue, such as

It's a common practice to work with large datasets. For example, to retrieve data from an API endpoint and store it into an array. Sometimes the dataset can increase in size so much that it can cause memory overflow.

Which is a fair problem. But then goes the trick: when approaching generators, an article completely forgets that "array with large dataset"! Substituting it with a generated sequence. And then proceeds further to display the benefits of generators when each value can be generated based on another. Which is simply impossible for the "array returned from API"!

These articles just substitute a real data array with a sequence of numbers - 1,2,3... While the latter indeed can be created using a generator, the former can not. So the reader, in the end, gets the wrong impression that generators indeed can somehow magically reduce the memory usage when working with large datasets.

Speaking of the real life solutions for the "large data from API" problem, it is not a generator but rather more traditional tools can be used: in case when some API being indeed so impolite to unleash a real big amount of data on us, all this data, instead of being stored into array, must be redirected to a local file (CURLOPT_FILE etc.) and then a streaming parser (such as salsify/json-streaming-parser) have to be used on that file in order to reduce the memory footprint. (Plot twist: such a parser could use generators internally but it would be a different story).

Or, getting even closer to the real life, every sensible API offers a pagination option, so our code can do subsequent calls, reading a moderate amount of data every time.

The manual page

Still I was wondering, why people might get the idea to write such an article? So I went to the manual page for generators and stumbled upon the following sentence:

"A generator allows you to write code that uses foreach to iterate over a set of data without needing to build an array in memory, which may cause you to exceed a memory limit, or require a considerable amount of processing time to generate".

It occurred to me, that when people read it, they likely jump at the memory/processing time stuff, while overlooking the "code that uses foreach" part. Which, actually, being the key here. It's all about foreach().

From this standpoint, the sentence is absolutely correct: when our code is using foreach() to iterate over some set of data, then indeed we can use a generator to reduce the memory usage. The thing is, we can always implement the same functionality without foreach(). So it is not a generator to praise for reducing the memory usage but a general principle of reading the data not whole but item by item. Though using foreach() could be extremely convenient. So,

The main feature of generators is not reducing the memory consumption, but convenience.

Generators are great but their power is way beyond those silly tricks with memory consumption. Generators help you to write much nicer and more flexible code. How?

Takeaway loops

If you look inside of any magic trick using generators, you will see an old tired for() or while() loop panting inside. This is the real hero, who is actually responsible for all the "memory consumption" hype. We were using them for decades, without generators and without any problems with memory.

So it's not the memory being the problem that is solved by generators. But the portability.

You may notice that in the every example above all the processing is inevitably done inside the loop. All the code is bound to be inside. And it is not very convenient.

Imagine we have a file consists of 1M lines with numbers that need to be used in some calculation. Like, to sum up all even values. Of course we can write a regular while loop:

$sum 0;
$handle fopen("file.txt""r");
while ((
$num fgets($handle)) !== false) {
    
// here goes our calculation
    
$num trim($num);
    if (
$item === ) {
        
$sum += $num;
    }
}

Fast, simple, low memory usage. No generators involved. The only problem, this code is not too clean. It does more than it's considered acceptable by the clean code standards: either reading some data from a file and doing some calculations with this data. If, someday, another source of data will be added - say, a CSV file - we will have to write another loop and duplicate the payload. Another source - another duplication. And so on.

Therefore, ideally, these two operations (reading the data in a loop and processing the data) should be separated. Before generators (or iterators), it was impossible (with same memory footprint).

Say, we decided to move the calculation into a separate function (or we already had one), that accepts an array, which is then iterated over using foreach():

function sum_even_values($array) {
    
$result 0;
    foreach (
$array as $item) {
        if (
$item === ) {
            
$result += $item;
        }
    }
    return 
$result;
}

With traditional approach, we are bound to waste a lot of memory

$sum sum_even_values(file("file.txt"));

as we are inevitably reading the entire file into array.

And only here generators are to the rescue!

Using a generator, we can create a file() function substitute that doesn't read the entire file in memory, but instead returns line after line:

function filerator($filename) {
    
$handle fopen($filename"r");
    while ((
$line fgets($handle)) !== false) {
        yield 
trim($line);
    }
}

and now we can use this generator with our function, keeping the low memory footprint:

$sum sum_even_values(filerator("file.txt"));

As a result, we've got a much cleaner code, which is the real benefit of generators.

Essentially, we separated a loop from its payload. With traditional loops, we are bound to do all the processing inside. But generators allow us to detach the payload from the loop, making loops portable. They can wrap a loop in a parcel and deliver it elsewhere, where all the processing will be done. This is so amazing that I can't help but feel like shown a rabbit coming from a magician's hat ;)

What generators can't do

That's simple: they don't offer the full array functionality, that is - random access. So if you need it, a generator won't likely yield any useful result (pun intended).

In other words, generators aren't about arrays. All they about is streams and sequences. Once we have a formula that can predict the next value or a source that can return values one by one, and we don't want to read them all at once, yet want to process them using foreach - that's the exact use case for generators.


Related articles: