Priyanka Nawalramka

Go: Tee it Together - io.TeeReader()

2026-06-09T00:00:00+00:00

Go: Tee it Together - io.TeeReader()

In my previous article, I talked about connecting io.Reader and io.Writer instances using io.Pipe(). Another handy construct in Go’s io package is io.TeeReader(). Yes, it is conceptually similar to the unix tee command. This standard library function accepts an io.Reader (original source) and an io.Writer (intercepting destination), and returns a wrapper struct, that implements io.Reader interface. When a call to Read(p []byte) occurs, the bytes are copied to the given destiation buffer, and additionally cloned in real-time to the underlying intercepting writer. This is best explained with an example.

package main

import (
    "crypto/sha256"
    "fmt"
    "io"
    "log"
    "strings"
)

func main() {
    const chunkSize = 8

    sourceVal := `Hello from this demo string.`
    srcRdr := strings.NewReader(sourceVal)

    hash := sha256.New()
    rdr := io.TeeReader(srcRdr, hash) // compute the digest of the bytes while reading it
    var totalBytesRead int
    buffer := make([]byte, chunkSize)
    for {
        bytesRead, err := rdr.Read(buffer)
        if err != nil && err != io.EOF {
            log.Fatalf("error reading from soure: %s\n", err)
        }
        if bytesRead == 0 {
            break
        }

        processChunk(buffer[:bytesRead])
        totalBytesRead += bytesRead
    }

    fmt.Printf("total bytes read: %d\n", totalBytesRead)
    fmt.Printf("Hashsum: %x\n", hash.Sum(nil))
}

func processChunk(b []byte) {
    // simulate processing data in chunks
    fmt.Printf("processing chunk: %s\n", string(b))
}

Points to note

The underlying read implementation does not explicitly buffer the bytes read.
The read call blocks until all bytes are written to the intercepting destination. This may have a throttling effect on the original read logic if the writes are slow.

Under the hood

The Read() implementation of the internal teeReader wrapper struct calls Read() on the original source to populate the given buffer. Additionally, the successfully read bytes will be written to the intermediate destination before the call returns. Any write errors are propagated up to the read caller. If you’re asking: Priyanka, could I do this myself? My answer: Yes, absolutely. teeReader is just a language provided syntactical sugar. In fact, the source code is under 20 lines of code.

Go: Streaming Data via io.Pipe()

2026-06-04T00:00:00+00:00

Go: Streaming Data via io.Pipe()

In your production app, you might have come across a scenario where you have a large number of records sitting in a database and you need to send it over a network, without blowing up the application memory. A typical example is wanting to extract all of that data and dump it into S3, or send it over an HTTP endpoint. The go standard library has a simple, yet powerful tool that can be leveraged for this.

io.Pipe() provides an interface to create a unidirectional channel for synchronous flow of data between a sender and a receiver, without the additional overhead of a temporary in-memory buffer. If you have used unix pipes before, the idea is exactly the same. Here is how it works.

Send data to stdout

In this simple demonstration, a stream of records are written to the channel within a goroutine. Another goroutine (the main goroutine) reads the records from the stream, as they become available, and passes it to stdout for display.

package main

import (
	"encoding/json"
	"io"
	"log"
	"os"
	"time"
)

func main() {
	rdr, wrtr := io.Pipe()

	go func() {
		defer wrtr.Close() // writer must be closed to signal end of stream to reader

		data := `["list record 1", "list record 2"]`
		if writeE := json.NewEncoder(wrtr).Encode(data); writeE != nil {
			log.Fatalf("error writing to pipe: %s\n", writeE)
		}

		time.Sleep(1 * time.Second)

		data = `["list record 3", "list record 4"]`
		if writeE := json.NewEncoder(wrtr).Encode(data); writeE != nil {
			log.Fatalf("error writing to pipe: %s\n", writeE)
		}
	}()

	defer rdr.Close()
	if _, err := io.Copy(os.Stdout, rdr); err != nil {
		log.Fatal(err)
	}
}

Send data over HTTP

Similarly, you can send data over an HTTP endpoint in chunks without explicit buffering.

package main

import (
	"encoding/json"
	"io"
	"log"
	"net/http"
	"os"
	"time"
)

func main() {
	rdr, wrtr := io.Pipe()

	go func() {
		defer wrtr.Close() // writer must be closed to signal end of stream to reader

		log.Println("sending batch 1")
		data := `["list record 1", "list record 2"]`
		if writeE := json.NewEncoder(wrtr).Encode(data); writeE != nil {
			log.Fatalf("error writing to pipe: %s\n", writeE)
		}

		time.Sleep(1 * time.Second)

		log.Println("sending batch 2")
		data = `["list record 3", "list record 4"]`
		if writeE := json.NewEncoder(wrtr).Encode(data); writeE != nil {
			log.Fatalf("error writing to pipe: %s\n", writeE)
		}
	}()

	defer rdr.Close()
	resp, err := http.Post("https://httpbin.org/anything", "application/json", rdr)
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()
	io.Copy(os.Stdout, resp.Body) // display the response
}

This approach can be used in orchestrating a memory efficient workflow to pull data in chunks from one source, and sending it over the network without allocating unnecessary in-memory buffers. An example would be pulling a large dataset from the database and streaming it directly to S3 via a multipart upload.

Points to note

The reader blocks until the writer sends bytes to the pipe or the write end of the pipe is closed.
A write call blocks until the reader consumes all the bytes written to the pipe or the read end is closed.
Due to the synchronous blocking nature, read and write operations must be performed in separate goroutines to prevent deadlocks.
io.Pipe() is probably an overkill for simple use cases processing smaller payloads that are better off using an in-memory buffer.

Under the hood

Current version of Go implements the in-memory pipe using channels for synchronization and passing the data between read and write ends.

Creating a Loop Device in Linux

2023-11-05T00:00:00+00:00

Overview

If you have ever downloaded a new Linux distribution ISO image, you may have wondered how to view the content of the image prior to repartitioning your disk and installing the operating system onto your local disk. This can be done via a loop mount in Linux.

In Linux and other UNIX-like systems, it is possible to use a regular file as a block device. A loop device is a virtual or pseudo-device which enables a regular file to be accessed as a block device. Say you want to create a Linux file system but do not have a free disk partition available. In such a case, you can create a regular file on the disk and create a loop device using this file. The device node listing for the new pseudo-device can be seen under /dev. This loop device can then be used to create a new file system. The file system can be mounted, and its content can be accessed using normal file system APIs.

Uses of Loop Device

As described above, one of the uses is creating a file system with a regular file when no disk partition is available.

Another common use of a loop device is with ISO images of installable operating systems. The content of an ISO image can be easily browsed by mounting the ISO image as a loop device.

Creating a Loop Device in Linux

These commands require root privilege.

Create a large regular file on disk that will be used to create the loop device.
```
 # dd if=/dev/zero of=/loopfile bs=1024 count=51200

 51200+0 records in

 51200+0 records out

 52428800 bytes (52 MB, 50 MiB) copied, 0.114882 s, 456 MB/s
```
This command creates a 50Mb file called loopfile filled with zeros.

If you already have an image file that you want to mount as a loop device, then you can skip this step.
Create a loop device with the large file created above.

There may be some loop devices already created. Run the following command to find the first available device node.
```
 # losetup -f

 /dev/loop1
```
So we can safely use /dev/loop1 to create our loop device. Create the loop device with the following command.
```
 # losetup /dev/loop1 /loopfile
```
If you see no errors, the regular file /loopfile is now associated with the loop device /dev/loop1.

Confirm creation of the loop device

 # losetup /dev/loop1

 /dev/loop1: [66309]:214 (/loopfile)

Creating a Linux Filesystem With the Loop Device

You can now create a normal Linux filesystem with this loop device.

Create an ext4 filesystem using /dev/loop1.

 # mkfs -t ext4 -v /dev/loop1

 mke2fs 1.45.3 (14-Jul-2019)

 fs_types for mke2fs.conf resolution: 'ext4', 'small'

 Discarding device blocks: done                            

 Filesystem label=

 OS type: Linux

 Block size=4096 (log=2)

 Fragment size=4096 (log=2)

 Stride=0 blocks, Stripe width=0 blocks

 12800 inodes, 12800 blocks

 640 blocks (5.00%) reserved for the super user

 First data block=0

 Maximum filesystem blocks=14680064

 1 block group

 32768 blocks per group, 32768 fragments per group

 12800 inodes per group

 Allocating group tables: done                            

 Writing inode tables: done                            

 Creating journal (1024 blocks): done

 Writing superblocks and filesystem accounting information: done

Create a mount point for the filesystem.
```
 # mkdir /mnt/loopfs
```
Mount the newly created filesystem.
```
 # mount -t ext4 /dev/loop1 /mnt/loopfs
```
This command mounts the loop device as a normal Linux ext4 filesystem, on which normal filesystem operations can be performed.

Check disk usage of the file system.

 # df -h /dev/loop1

 Filesystem      Size  Used Avail Use% Mounted on

 /dev/loop1       45M   48K   41M   1% /mnt/loopfs

Use tune2fs to see the filesystem settings.

 #  tune2fs -l /dev/loop1

 tune2fs 1.45.3 (14-Jul-2019)

 Filesystem volume name:   

 Last mounted on:          

 Filesystem UUID:          b1b13d6e-c544-45dd-a549-5846371fbde6

 Filesystem magic number:  0xEF53

 Filesystem revision #:    1 (dynamic)

 Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum

 Filesystem flags:         signed_directory_hash 

 Default mount options:    user_xattr acl

 Filesystem state:         clean

 Errors behavior:          Continue

 Filesystem OS type:       Linux

 Inode count:              12800

 Block count:              12800

 Reserved block count:     640

 Free blocks:              11360

 Free inodes:              12789

 First block:              0

 Block size:               4096

 Fragment size:            4096

 Group descriptor size:    64

 Reserved GDT blocks:      6

 Blocks per group:         32768

 Fragments per group:      32768

 Inodes per group:         12800

 Inode blocks per group:   400

 Flex block group size:    16

 Filesystem created:       Sun Mar 19 08:56:47 2023

 Last mount time:          Sun Mar 19 09:00:52 2023

 Last write time:          Sun Mar 19 09:00:52 2023

 Mount count:              1

 Maximum mount count:      -1

 Last checked:             Sun Mar 19 08:56:47 2023

 Check interval:           0 ()

 Lifetime writes:          37 kB

 Reserved blocks uid:      0 (user root)

 Reserved blocks gid:      0 (group root)

 First inode:              11

 Inode size:              128

 Journal inode:            8

 Default directory hash:   half_md4

 Directory Hash Seed:      e489fd33-4003-4235-9347-144c7a5d4d73

 Journal backup:           inode blocks

 Checksum type:            crc32c

 Checksum:                 0x3b8c797a

To unmount the filesystem and delete the loop device, run the following commands.
```
 # umount /mnt/loopfs/

 # losetup -d /dev/loop1
```

This article was initially published here.

Memory Profiling in Python with tracemalloc

2023-11-04T00:00:00+00:00

Memory Profiling in Python with tracemalloc

Memory profiling is useful when looking at how much memory an application is using. The most important use case, however, is when you suspect that the application is leaking memory, and it is important to trace where that leak occurs in the code.

A memory leak happens when a block of memory allocated by an application is not released back to the operating system, even after the object is out of scope and there are no remaining references to it. When this happens, memory utilization keeps increasing, until an OOM (out-of-memory error) occurs, and the operating system kills the application. If the utilization metric is plotted on a graph, it will display a constantly growing trend until the application dies. If the application restarts itself after the OOM, it will exhibit a sawtooth behavior, indicating the repeated increase and sudden drop at OOM. In programming languages where the compiler/interpreter doesn’t manage allocations, a leak can occur if the developer forgets to free allocated memory. It is, however, not uncommon, to see memory leaks in memory-managed languages as well.

Python manages memory allocations itself. Python’s memory manager is responsible for all allocations and deallocations on the private heap. By default, Python uses reference counting to keep track of freeable objects. The optional garbage collector provides supplemental mechanisms to detect unreachable objects, otherwise not reclaimed due to cyclic referencing.

Several tools are available for profiling memory in Python. This article covers “tracemalloc” which is part of the standard library. Getting started is very easy because no special installation is needed. One can start profiling and viewing the statistics straight out of the box. Tracemalloc provides statistics about objects allocated by Python on the heap. It does not account for any allocations done by underlying C libraries.

The profiler works by letting you take a snapshot of memory during runtime. There are several convenient APIs to then display the statistics contained in that snapshot, grouped by filename or lineno. It is also possible to compare two snapshots, which helps identify how memory use changes over the course of program execution. Finally, there are methods to display the traceback to where a block of memory is allocated in the code.

I will go through an example that simulates constant growing memory (similar to a leak) and how to use the tracemalloc module to display statistics and eventually trace the line of code introducing that leak.

Tracing a memory leak

Here is a one-liner function called mem_leaker() that will be used to simulate the memory leak. It grows a global array by ten thousand elements every time it is invoked.

arr = []
def mem_leaker():
	'''Appends to a global array in order to simulate a memory leak.'''
	arr.append(np.ones(10000, dtype=np.int64))

I wrapped this function inside a script named memleak.py with some driver code. The first thing to do is to call tracemalloc.start() as early as possible in order to start profiling. The default frame count is 1. This value defines the depth of a trace python will capture. The value can be overridden by setting PYTHONTRACEMALLOC environment variable to a desired number. In this example, I am passing a value of “10” to set the count to ten at runtime.

The tiny for loop iterates five times, invoking the make-shift memory leaker each time. The call to gc.collect() just nudges Python’s garbage collector to release any unreachable memory blocks to filter out noise. Although, you can be sure this program does not create any cyclic references.

memleak.py

import gc
import tracemalloc

import numpy as np

import profiler

arr = []
def mem_leaker():
	'''Appends to a global array in order to simulate a memory leak.'''
	arr.append(np.ones(10000, dtype=np.int64))


if __name__ == '__main__':
	tracemalloc.start(10)

	for _ in range(5):
    	    mem_leaker()
    	    gc.collect()
    	    profiler.snapshot()

	profiler.display_stats()
	profiler.compare()
	profiler.print_trace()

All the profiling code is written inside a separate script called profiler.py.

profiler.py

import tracemalloc

# list to store memory snapshots
snaps = []

def snapshot():
	snaps.append(tracemalloc.take_snapshot())


def display_stats():
	stats = snaps[0].statistics('filename')
	print("\n*** top 5 stats grouped by filename ***")
	for s in stats[:5]:
    	    print(s)


def compare():
	first = snaps[0]
	for snapshot in snaps[1:]:
    	    stats = snapshot.compare_to(first, 'lineno')
    	    print("\n*** top 10 stats ***")
    	    for s in stats[:10]:
              print(s)


def print_trace():
	# pick the last saved snapshot, filter noise
	snapshot = snaps[-1].filter_traces((
    	    tracemalloc.Filter(False, ""),
    	    tracemalloc.Filter(False, ""),
    	    tracemalloc.Filter(False, ""),
	))
	largest = snapshot.statistics("traceback")[0]

	print(f"\n*** Trace for largest memory block - ({largest.count} blocks, {largest.size/1024} Kb) ***")
	for l in largest.traceback.format():
    	    print(l)

Going back to the for loop in the driver code, after mem_leaker() is invoked, profiler.snapshot(), which is just a wrapper around tracemalloc.take_snapshot(), will take a snapshot and store it in a list. The length of the list will be five at the end of the loop.

Once you have the snapshots, you can see how much memory was allocated. It can be useful to begin with per file grouping if the application is big, and you have no idea where the leak is happening. For demonstration, take a closer look at profiler.display_stats(); it displays the first five grouped items from the first snapshot.

def display_stats():
	stats = snaps[0].statistics('filename')
	print("\n*** top 5 stats grouped by filename ***")
	for s in stats[:5]:
    	    print(s)

The output looks like this.

At the top of the list, numpy’s numeric.py allocated 78.2 Kb of memory. The tracemalloc module itself will also use some memory for profiling. That should be ignored during observation.

Comparing snapshots and observing trends

Now, in order to debug further, The code compared the first snapshot with each subsequent snapshot to see the top ten differences using compare().

def compare():
	first = snaps[0]
	for snapshot in snaps[1:]:
    	    stats = snapshot.compare_to(first, 'lineno')
    	    print("\n*** top 10 stats ***")
    	    for s in stats[:10]:
              print(s)

The complete output of compare() looks like this.

Since there were five total snapshots, there are four sets of statistics from the comparisons. Some important observations here:

numpy/core/numeric.py at line 204, is at the top in each set, allocating the most memory.
Now, note the change in memory usage by numeric.py as I go over each of the result sets.
- In the first set, the total memory used by numeric.py is 156 Kb, an increase of 78 Kb from the first snapshot.
- In the second set, total memory used by numeric.py jumps to 235 Kb, a total increase of 156 Kb from the first snapshot. Doing a little math, this also means there was an increase of ~79 Kb since the last snapshot.
- In the third set, total memory used by numeric.py jumps to 313 Kb, a total increase of 235 Kb from the first snapshot. This also means there was again an increase of ~78 Kb since the last snapshot.
- In the fourth set, total memory used by numeric.py jumps to 391 Kb, a total increase of 313 Kb from the first snapshot. This also means there was, once again, an increase of ~78 Kb since the last snapshot.

A clear cumulative trend can be seen here in the allocation done by numeric.py. At each iteration, ignoring rounding differences, ~78 Kb more memory was allocated. If you look at mem_leaker() again, it appends ten thousand elements to the array, each of size 8 bytes, which brings the total to 80000 bytes, or 78.125 Kb. This matches the observed increase from the snapshot deltas.

It is constantly growing trends like this that eventually lead to a code problem if the growth pattern is unexpected.

See a traceback

The last thing in the example was to print the traceback for the largest memory block for more granularity. In this case, the largest block is one from numeric.py. The filter_traces() method is very helpful in eliminating noise when debugging long traces.

def print_trace():
	# pick the last saved snapshot, filter noise
	snapshot = snaps[-1].filter_traces((
    	    tracemalloc.Filter(False, ""),
    	    tracemalloc.Filter(False, ""),
    	    tracemalloc.Filter(False, ""),
	))
	largest = snapshot.statistics("traceback")[0]

	print(f"\n*** Trace for largest memory block - ({largest.count} blocks, {largest.size/1024} Kb) ***")
	for l in largest.traceback.format():
    	    print(l)

The trace looks like this.

This trace leads from the application code to the line in the numpy library, where the allocation actually takes place. You can look at the source code here.

Conclusion

Here, you saw a simple demonstration of using Python stdlib’s tracemalloc module to observe various memory related statistics, compared snapshots to see the allocation deltas and used all this information to trace back to the code using a substantial amount of memory. In large applications, it may take some time to narrow down scope and find the line of code introducing a leak. The tracemalloc module provides all the necessary APIs to do so, and it just works out of the box.

References

https://docs.python.org/3/library/tracemalloc.html

This article was initially published here.

Batch Processing in Go

2020-11-25T00:00:00+00:00

Batch Processing in Go

Batching is a common scenario developers come across. Basically, to split a large amount of work into smaller chunks for optimal processing.

Seems pretty simple, and it really is. Say, we have a long list of items we want to process in some way. A pre-defined number of them can be processed concurrently. I can see two different ways to do it in Go.

First, using plain old slices. This is something most developers have probably done at some point in their career. Let’s take this simple example:

func main() {
	data := make([]int, 0, 100)
	for n := 0; n < 100; n++ {
		data = append(data, n)
	}
	process(data)
}

func processBatch(list []int) {
	var wg sync.WaitGroup
	for _, i := range list {
		x := i
		wg.Add(1)
		go func() {
			defer wg.Done()
			// do more complex things here
			fmt.Println(x)
		}()
	}
	wg.Wait()
}

const batchSize = 10

func process(data []int) {
	for start, end := 0, 0; start <= len(data)-1; start = end {
		end = start + batchSize
		if end > len(data) {
			end = len(data)
		}
		batch := data[start:end]
		processBatch(batch)
	}
	fmt.Println("done processing all data")
}

The data to process is a plain list of integers. To keep things simple, we just want to print all of them, at most 10 concurrently. To achieve this, we loop over the list, divide it into chunks of batchSize = 10 and process each batch serially. Short and sweet, and does what we want.

The second approach uses a buffered channel, similar to what’s described in this post on concurrency. Let’s look at the code first.

func main() {
	data := make([]int, 0, 100)
	for n := 0; n < 100; n++ {
		data = append(data, n)
	}
	batch(data)
}

const batchSize = 10

func batch(data []int) {
	ch := make(chan struct{}, batchSize)
	var wg sync.WaitGroup
	for _, i := range data {
		wg.Add(1)
		ch <- struct{}{}
		x := i
		go func() {
			defer wg.Done()
			// do more complex things here
			fmt.Println(x)
			<-ch
		}()
	}
	wg.Wait()
	fmt.Println("done processing all data")
}

This example uses a buffered channel of size 10. As each item is ready to be processed, it tries to send to the channel. Sends are blocked after 10 items. Once processed, it reads from the channel, thereby releasing from the buffer. Using a struct{}{} saves us some space, because whatever is sent to the channel never gets used.

As the author of the post points out, here we’re exploiting the properties of a buffered channel to limit concurrency. One might argue, this is not really batching, rather it’s concurrent processing with a threshold. And I would totally agree. Regardless, it gets the job done and the code is tad simpler.

Is it any better than slices? Probably not. As for speed, I timed the execution of both programs, they ran pretty close. These examples are far too simple to see any significant difference in runtime. Channels in general are slower and more expensive than slices. Since there is no meaningful data being passed between the goroutines, it’s probably a wasted effort. So why would I do it this way? Well, I like simple code. But that might not be enough of a reason. If the cost of serial processing of each batch outweighs the cost of using a channel, it might be worth a consideration!

Note: Code snippets on this post are available here.

This content is also available here.