Scala streams and Python generators

I am starting to learn Scala, coming from Python. This post is about comparing both languages regarding what we call generators in Python, and how I happily overcame an initial frustration with Scala.

Python generators

One of the greatest things about Python is that you can easily work with generators, which are objects that you can iterate over them, but their contents are not necessarily in memory. They might be calculated “on-the-fly”, for instance, or they may be file chunks or data coming from a database in an iterative fashion, and not all of it at once.

A generator can have an internal state, and you usually produce values inside of a loop. Generator functions are a simple way to create generators in Python, they are functions that use the yield keyword. Here is one example generator function to calculate the Fibonacci sequence.

# File: fibogen.py
from itertools import islice

def fiboFrom(a, b):
    yield a
    while True:
        a, b = b, a + b
        yield a

print [n for n in islice(fiboFrom(1, 1), 0, 19)]

And the output:

> python fibogen.py
[1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181]

Notice islice was necessary to pick just a piece from the sequence. The generator itself can be accessed indefinitely.

This code uses a list comprehension in the end, another great Python feature, that is closely related to generator expressions, which are just like list comprehensions, except they are written with parenthesis instead of square brackets, and they create a generator instead of a list. Therefore they are a way to make “lazy” transformations to another list of values, for instance. List comprehensions and generators are they Python way to perform list filtering and mapping. And by using generators you can be sure you are always sparing memory, and only performing calculations when they are strictly necessary, so that tends to make your code safer and faster. Compare it to filling you memory with large intermediary lists of values that are actually intermediary calculations and used only once, or never used at all sometimes.

(Fun story: while I was writing the code I accidentally ran it without islice. My machine hanged and Python was eventually killed. That was the list comprehension filling the memory. That’s why lazy generators are “safer”…)

Scala for-yield

When I started learning Scala, just the other week, I saw the `for-yield` expression and immediately associated it with generator expressions from Python. I thought I could just write my code using these expressions like they were Python generators. So I did a script to read a CSV file with data from a bunch of people, parsing each line, etc, and then calculating some aggregate statistics from this file.

I did the code, and it worked for a small 10 thousand rows input. It seemed fast and nice. Then I tried to run the code on the large file, with 3.6GB and 5.7 million rows, and to my surprise… I had a memory exhaustion failure!

I was shocked, and I figured out immediately that the for-yield expressions were not working as Python generators, and instead I was reading the whole file into memory first, to start transforming and calculating the aggregate statistics in a separate step, when what I wanted was to read the file line-by-line and update the statistics iteratively. Because I am not insane.

It took me a while looking into Stackoverflow and other fora, and browsing through books, and for a moment I even believed it was possible that having the generator behavior might be something “unnatural” in Scala, even though possible. I saw many people showing how to create classes in order to emulate the generator behavior… I was really pissed off, I must say. And I was surprised too that the Scala literature doesn’t seem to care much about this. It is never too clear when you are potentially creating a huge list in memory, and when you are working with a memory-bounded generator. This is important, because it may be the difference between you being able to work with a large file or not, as it happened to me in practice.

So, anyway, eventually I stumbled upon the solution… I still don’t know much about Scala to give a full explanation, but the thing is Scala has an iterator type called Stream, and that is what you should be iterating over if you want the lazy behavior from a Python generator. Otherwise, you will probably end up with a List or something else and blow up your memory.

In my program I was reading the file using scala.io.Source, and eventually it was turning into a list. All I needed was to make sure it was a Stream I was iterating over, and the problem was fixed. I mean, I just implemented this solution and I believe it is fixed. I am still waiting for the program to run reading that big file, but memory wasn’t exhausted yet!

The next section has an example illustrating the generator behavior in Scala and in Python.

Python generator behavior in Scala

This following Python code doesn’t make much sense, but it is a program with two nested generators. The first generator is just a range from 1 to 3. The second one has a two-level for, with the first over the first generator, and the second with three values starting from the successor of the variable from the first loop. Then we have a loop over the second generator, print the values coming out. Inside each of the generators we have print statements, that allow us to see what is happening in the program flow.

# File: generator_demo_good.py

def genA():
    for a in range(1, 4):
        print "a" + str(a),
        yield a

def genB():
    for a in genA():
        for b in range(a + 3, a + 6):
            print "b" + str(a) + str(b),
            yield b

for b in genB():
    print "c" + str(b),

And the output:

> python generator_demo_good.py
a1 b14 c4 b15 c5 b16 c6 a2 b25 c5 b26 c6 b27 c7 a3 b36 c6 b37 c7 b38 c8

The sequence seen in the final loop over the b variable (in the “c” loop sorry if that is confusing) should be “4, 5, 6, 5, 6, 7, 6, 7, 8”. These are the numbers we see in the output associated with ‘c’, but because we are using generators we see all the intermediary messages coming from the other two loops mixed together with the ‘c’ messages. That happened because we are stepping through those loops lazily, only at each iteration of c is that we do another iteration in the ‘b’ loop from the second generator, and then eventually a new iteration in the ‘a’ loop from the first generator.

Now let’s see what happens if we try to write that code in Scala in the “straight away” fashion, replacing the python generator functions with if-yield expressions with the same parameters.

// File: generator_demo_fail.scala
object TestGen extends App {

  def genA = for (a <- (1 to 3)) yield {
    print("a"+ a + " ")
    a
  }

  def genB = for {
    a <- genA
    b <- (a + 3 to a + 5)
  } yield {
    print("b" + a + b + " ")
    b
  }

  for (c <- genB) print("c" + c + " ")
}

TestGen.main(args)

And the output is:

> scala generator_demo_fail.scala
a1 a2 a3 b14 b15 b16 b25 b26 b27 b36 b37 b38 c4 c5 c6 c5 c6 c7 c6 c7 c8

See what happened? Scala is first iterating all over the ‘a’ loop from the first generator, which is actually not really what I am used to call a “generator” because it didn’t produce its values lazily, and then it is iterating over the ‘b’ loop from the second generator, and then the ‘c’ loop. All the iterables are being swept at the beginning of the loops, and only after that we enter the body of the loop and actually start processing stuff. That means a whole file would be read in memory before we even start processing the first row, for instance.

Now the fix for this behavior in Scala is actually pretty simple, you just need to append a “.toStream” to transform the Range iterables into Streams. Now the if-yield expression will actually work as a Stream, or a “Pythonic generator”, just like we wanted. So the new code is simply:

// File: generator_demo_good.scala
object TestGen extends App {

  def genA = for {
    a <- (1 to 3).toStream
  } yield {
    print("a" + a + " ")
    a
  }

  def genB = for {
    a <- genA
    b <- (a + 3 to a + 5).toStream
  } yield {
    print("b" + a + b + " ")
    b
  }

  for (c <- genB) print("c" + c + " ")
}

TestGen.main(args)

And the output:

> scala generator_demo_good.scala
a1 b14 c4 b15 c5 b16 c6 a2 b25 c5 b26 c6 b27 c7 a3 b36 c6 b37 c7 b38 c8

As you can see, we perfectly reproduced the Python behavior.

Now for completeness we might ask, how do you spoil the Python code to reproduce the first, “normal” Scala behavior? Here is how you might do it, just convert the generators into lists at the beginning of each loop:

def genA():
    for a in range(1, 4):
        print "a" + str(a),
        yield a

def genB():
    for a in list(genA()):
        for b in range(a + 3, a + 6):
            print "b" + str(a) + str(b),
            yield b

for b in list(genB()):
    print "c" + str(b),

And the output:

> python generator_demo_fail.py
a1 a2 a3 b14 b15 b16 b25 b26 b27 b36 b37 b38 c4 c5 c6 c5 c6 c7 c6 c7 c8

Fibo again

Since Scala can do lazy generators, how would the first Fibonacci sequence example look in Scala? I actually wrote that first Python example as a counterpart to a lazy Fibonacci example in a book, which looks like this:

// File: fibogen.scala
object FiboGen extends App {
  def fibFrom(a: Int, b: Int): Stream[Int] =
    a #:: fibFrom(b, a + b)

  println(fibFrom(1, 1).take(19).toList)
}
FiboGen.main(args)

And the output:

> scala fibogen.scala
List(1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181)

The #:: operator seems crazy, but it’s just the “append” operator from a Stream. We are just saying what the first element is, and how to calculate the next element in the sequence. This looks pretty fucking awesome if you ask me.

Conclusion

We have shown how Scala can easily reproduce the lazy evaluation behavior from Python generators, even though that may not be what you get if you write your Scala code “straight away”, with the cleanest possible syntax. It seems Scala creators decided to give more priority to a behavior that goes writing everything you can in memory. If you need the lazy evaluation behavior you have to be careful about weather you are indeed working with a Stream. Just make sure you have a Stream, then, and you are safe.

So in Python you will mostly get generators, unless you transform a generator expression into a list comprehension, or you explicitly turn a generator into a list… In Scala you may not be sure of what you are doing, and you may accidentally waste your memory. I hope I will understand better how everything works with practice. But to be fair: sometimes in Python you are also not sure if an object is a list or a generator, or even iterable at all, and that is precisely the kind of situation that drives me mad in Python, and that I hope I can avoid by using Scala.

Looking at these examples is very instructive about other aspects of the two languages. Now I look at the for-yield expression from Scala as something between a Python generator function and a generator expression. The generator function is quite general and flexible, and it lets you have multiple different yield statements in the function body. But that means you can easily write sloppy and kludgy code, and that simple stuff can look ugly, like that Fibonacci implementation. Generator expressions can have pretty complicated and powerful code in the ‘for’ section, but you are limited to a single expression where you produce the values.

On the Scala side, the for-yield expression also lets you do a lot of stuff in the ‘for’ clauses section. But I like very much the fact that you don’t need to go on indenting further and further if you have nested loops, and I am specially fond of the fact you can assign variables up there. That means you can filter over calculated values without needing to recalculate them, for instance, what is extremely useful. And then on the ‘yield’ block you have a function that is applied like a map over the input. So it’s restricted like a generator expression, but you can write more complex code there, without having to create a Python lambda, for instance.

Scala is equipped from the start with lots of useful list-processing things that Python does not have. For instance, Scala iterables support methods such as flatMap and foldLeft, and also the take method that we used in the Fibonacci example. In Python you often need to import stuff from collections or itertools like we did to use the islice function that made our code look very ugly. I also love it how a one-liner ‘for-print’ loop looks in Scala! You can do it in Python too but it doesn’t look natural…

All that means Scala is truly a more “functional” language than Python… But at the same time it lets you do stuff like sticking print statements in the middle of the code to see what is happening, something that may not be so easy to do in more traditional functional languages.

And I finish the article just saying that the program did finish running successfully while I was writing this! 🙂

Advertisements

9 thoughts on “Scala streams and Python generators

  1. ikaruga2099

    Awesome post. I was wondering the same thing about about Scala for-loops.
    The one thing that I can’t seem to figure out how to do is the old-fashioned for loop [ex: for(int i=0; i<…; i++)]
    For straight math calculations, this is the fastest and least memory-intensive way to go.
    A traditional Java app will beat out Scala streams in that case.

    Reply
    1. nlw0 Post author

      Thanks. Well, a for in Scala will have to take values from an iterable, usually a range. If you need anything different you should use either a recursive function or a while loop. But the better still is to try to use something like map or reduce actually.

      Comparing the performance of Scala and Java (and C) can be surprising sometimes. Scala has a known issue with the performance of its for loops ( https://issues.scala-lang.org/browse/SI-1338 ), but there are solutions out there, and this should cease to be a problem in new versions. Scala may also have problems related to the boxing of native numeric types, etc… If you really need the best performance for some big numeric processing you will always have to go lower-level. But do always try a high-level version first to see how bad the result is!

      Reply
  2. ikaruga2099

    I should also point out that I suspect that not all generator functionality can be easily replicated with streams.
    Javascript also has generators (via the ES6 standard) and streams (via libraries like Kefirjs or Baconjs).
    I’ve been replicating ReactJs’ Redux library using streams (namely, functional reactive programming). Everything has been going smoothly until I stumbled on redux-saga—a side effects library using generators.
    For the life of me I can’t seem to figure out how to replicate the same intuitive API using streams.
    We’ll see how it goes…

    Reply
    1. nlw0 Post author

      Hmm I had never heard of that, seems interesting. For reactive/async programming I have only been working with Akka, have you tried it? I think it may be quite a different from that, but it’s very cool. And for UI there is Scala.js also, I only played a little and couldn’t tell if it might suit your needs, but it looks like a powerful library and connected to all these modern trends…

      Reply
  3. Mike

    Just want to pint out, that scala’s streams not quiet same as python’s generators, because streams do perform memoization of computed elements, caching them, another words. But generators AFAIK don’t cache anything. So, to be closer to your python’s code, I think u need Range.iterator not Range.toStream for scala.
    And many thanks for article – I moving vice versa, from scala to python and your experiments listed here helped a lot to build some understanding of generators.

    Reply
      1. nic Post author

        Yeah, definitely. There is nothing quite like a Stream in Python, and perhaps even in any other languages I know. In Python you get either a list (vector) or an iterator… For the purposes of this article here I might have just user iterator actually, but then I’m also not very sure we could make such a nice implementation of Fibo, comparable or better than the first Python one with a generator function. Exercise left to the reader! 🙂

        But anyway, after writing this I came to learn a lot more about Scala data structures, and learned how superior it is to Python and other languages. Hope I can help other people to learn that out more easily.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s