The pmem suite of memory acquisition tools

May 23, 2016, 5:58 pm

≫ Next: Rekall and the windows PFN database

≪ Previous: Installing Rekall on Windows

The recent Rekall Furka release updated the pmem suite of acquisition tools, so I thought this would be a good time to write a blog post about the new features and to recap the work we have been doing on reliable and robust memory acquisition for Linux, OSX and Windows.

The Pmem memory acquisition suite

The Pmem suite of memory acquisition tools are already quite well known as the best open source memory acquisition tools available, and in the case of OSX, the only reliable memory acquisition solution available for the latest versions of the operating system. This blog post discusses some of the recent changes we made in the recently released pmem 2.1 series of acquisition tools. The old Winpmem 1.6.2 tool is still available and it is perhaps the most stable and battle tested, but the new series offers many advantages, so it might be worth testing these with your own incident response procedures.

While access to physical memory is different on each operating system, we tried to unify the userspace component as much as possible. This should make it easier to use because the same command line options are available on each operating system. Previous tools had completely different sets of options, but now most of the options are the same across all target OSs.

OSXPmem, WinPmem, LinPmem .... So many pmem's...

The Pmem suite offers a complete memory acquisition solution and consists of a few different sub-components. It is sometimes a bit confusing when we talk about so many different things all called pmem.

Here is a high level overview of the different components:

We see that there are four main kernel based components to facilitate access to physical memory:

On OSX, the MacPmem memory driver provides direct access to physical memory.
On Windows, the WinPmem kernel driver provides physical memory access.
On Linux, we mainly rely on the built in /proc/kcore device which is often enabled. This provides raw physical memory access natively.
Sometimes, however, on Linux, the kcore device is disabled. In that case we also provide the pmem kernel driver which facilitates physical memory access.

All the kernel components enable access to physical memory, which we use to implement memory acquisition and live analysis. The memory acquisition tools are OSXPmem, WinPmem (2.1) and LinPmem are operating system specific acquisition tools. They all use the same unified AFF4 imager framework and therefore all have similar command line arguments, and produce images in the same way.

Finally, Rekall can use all kernel components directly when performing live analysis on all supported operating systems. Rekall also has a plugin called aff4acquire which uses this raw physical memory access to acquire a physical memory image, in a similar way to the standalone tools. Memory acquisition through Rekall is able to include additional data which can only be deduced from live analysis (e.g. it also captures all mapped files) but it requires running Rekall (with larger footprint and requires access to the profile repository).

What is AFF4 anyway, and why are you forcing me to use it?

Because the pmem imagers use the unified AFF4 imager framework, the pmem acquisition tools always write AFF4 images. What does that mean?

I know that many people are still using RAW or ELF images, mainly in order to interact with other tools that do not support AFF4 directly. Maybe users are not ready to commit to a new file format which is potentially not compatible with other tools?

AFF4 (Advanced Forensic Format 4) is a concept, as well as a file format. The main features of AFF4 are that it defines containers, streams and metadata:

A stream is an object which supports randomly seeking and reading in. For example, a regular file on disk is a stream. AFF4 also defines other stream formats.
Similarly, a container is just something that contains streams. For example, a regular filesystem directory is also an AFF4 container. By default pmem will use a ZIP file as a container but it can easily use a directory instead.
Finally, AFF4 contains metadata about the image. This is something you always want for memory acquisition - the more data the better! AFF4 stores metadata in RDF Turtle format by default, but Pmem also uses YAML.

The nice thing about AFF4 therefore, is that it is actually completely compatible with simpler file formats, such as RAW but at the same time, due to the additional metadata included, tools that understand AFF4 can make use of that automatically.

So when we say we only write AFF4 images, we really mean that the images that are produced are written in a structured way, but these images can also be made to look like a RAW or ELF image to other tools.

Let me see examples...

Let's take a look at an example. In this example, I am acquiring memory from an OSX system into a standard AFF4 ZIP container. This is the default output format:

In this example we specified:

-o test.aff4 write output into this file.
-t - output file will be truncated
-c snappy : Use Snappy compression. This is much faster than the default zlib compression and still pretty good compression.

The resulting AFF4 file is stored in a ZipFile container. What does that look like?

We can use osxpmem to show us some metadata about the image:

The volume contains two streams:

The first stream (ends with dev/pmem) is a map with a category of physical memory. This is the actual memory image.
The second stream is an aff4:image which stores bulk image data, divided into compressed chunks.

An AFF4 map is an efficient construct which allows use to store sparse images (with holes) such as memory images which usually have gaps for PCI DMA regions. We do not need to waste any space on sparse gaps. The Map itself is backed by a regular AFF4 image stream which uses compressed chunks to store the bulk data in the image. Hence we get both compression and sparse image.

We can use the regular unzip command to inspect it (NOTE: The default unzip that comes with OSX does not support ZIP64 very well so it reports the file as slightly corrupted. It is not, in fact corrupted at all and a proper unzip utility will support it well).

We can see some members in the zip file:

information.turtle contains RDF information about this AFF4 container.
dev/pmem/idx stores the map transformation (very small)
dev/pmem/data is the bulk data stream
dev/pmem/data/00000XXXXX are the segments which store the compressed chunks.
dev/pmem/data/00000XXXXX/index are indexes for each segment.

This image works well with Rekall which supports AFF4 natively:

But, suppose I wanted to use another program with this image, so I really want RAW output.

We could always export the memory image from the AFF4 volume into a raw file:

But this might be inconvenient. What if we wanted to acquire to RAW format in the first place? Pmem offers support for the following memory image formats:

It is important to note that the --format option talks about the format of the memory stream (i.e. RAW) not the format of the container which is still an AFF4 Zip based file. In the following example, I will create a zip file with a huge RAW memory image in it:

This is fine and it is very useful when you want to acquire a bunch of files off the same system (e.g. memory, pagefiles, drivers etc). In the end you just get a simple (but very large) ZIP file with all the files inside it - convenient for easy transport off the system. Note that even though the RAW image is stored in a zip file it is not compressed - so the zip file is still huge. This is because Rekall can use the zip file directly then without needing to unpack it first (You can not seek in a compressed archive member). The RAW image is just a single large uncompressed archive member.

This is still not that useful for running other tools on the image - we will need to unzip the image first so we can point our tool at it. Remember that I mentioned above that a simple filesystem directory is also an AFF4 container? Why don't we write the image into a directory?

In order to make pmem choose an AFF4 directory container you just specify a directory path. So you need to create a directory first:

Remember that AFF4 containers just store streams so it is exactly the same as the previous example, except that the files are written to a directory instead of a single Zip file. However, in this case, you will find the huge RAW physical memory file (dev%2fpmem) inside the directory - ready to be viewed with another tool:

Note also the metadata files which contain important information about the memory collected:

Rekall, however, understands AFF4 volumes so it can use the metadata directly to bootstrap analysis (e.g. use the correct profile automatically). Note that when we use Rekall we should specify the path to the entire directory, not the path to the RAW image inside the directory (So Rekall understands we mean the AFF4 volume to be used):

Adding files

We often only get one chance to acquire memory. We need to make sure that everything we might need in the analysis phase will end up in the image. For example, sometimes after analysis we see a mapped file that is of interest in the memory image. But trying to dump files from memory is not very reliable since many pages are missing - it would have been better to acquire these files in the first place.

The nice thing about AFF4 containers is that they can store multiple streams - i.e. they can capture more than one file. The Pmem suite tries to acquire as many files as it can automatically. For example, in windows the pagefile can be acquired, as well as all the drivers and the kernel image itself. In Linux, the contents of /proc/kallsyms is captured as well as the /boot/ partition.

It is possible to include other files during acquisition. This can be done during the memory acquisition or later, by appending to the AFF4 volume:

Note that by default, when you specify the -i flag, pmem assumes you want to add files to an existing volume and does not acquire memory. You can force memory acquisition and file inclusion at the same time with the -m flag.

The Winpmem acquisition tool

The driver component of Winpmem has been stable for many years now. Like the MacPmem driver, it uses advanced page table manipulation to bypass any OS restrictions (we published these techniques previously). So this release of winpmem reuses the same driver as Winpmem 1.6.2 - the only difference is in the userspace tool.

The OSXPmem acquisition tool

The OSX counterpart of the pmem suite comes as a zip file. To use it you simply need to unzip it as root (note: You must be root when unzipping on order to ensure proper file permissions. The extracted files must be owned by root and only readable by root).

↧

Rekall and the windows PFN database

May 23, 2016, 6:12 pm

≫ Next: Searching memory with Rekall

≪ Previous: The pmem suite of memory acquisition tools

Rekall has long had the capability of scanning memory for various signatures. Sometimes we scan memory to try and recover pool tags (e.g. psscan), other times we might scan for specific indicators or Yara rules (e.g. yarascan). In the latest version of Rekall, we have dramatically improved the speed and effectiveness of these capabilities. This post explains one aspect of the new Rekall features and how it was implemented and can be used in practice to improve your forensic memory analysis.

Traditionally one can scan the physical address space, or the virtual address space (e.g. the kernel's space or the address space of a process). There are tradeoffs with each approach. For example, scanning physical memory is very fast because IO is optimized. Large buffers are read contiguously from the image and the signatures are applied on entire buffers. However, traditionally, this kind of scanning could only yield a result when a signature was found but did not include any context to the hit, which means that it is difficult to determine which process owned the memory and what the process was doing with the memory.

If we scan the virtual address space we see memory in the same way that a process sees it. This is ideal because if a signature is found, we can immediately determine which process address space it appears in, and precisely where, in that address space, the signature resides.

Unfortunately, scanning in a virtual address space is more time consuming because reading from the image (or live memory) is non-sequential. Rekall effectively has to glue together in the right order a bunch of page size buffers collected randomly from the image into a temporary buffer which can be scanned - this involves a lot of copying buffers and allocating memory. Additionally when scanning the address space of processes, we will invariably be scanning the same memory multiple times because mapped files (like DLLs) are shared between many processes, and so they appear in multiple processes' virtual address space.

It would be awesome if we could ask: given a physical address (i.e. offset in the memory image), which process owns this page and what is the virtual address of this page in the process address space? Being able to answer this question quickly allows us to scan physical memory in the most efficient way (at least for smallish signatures which do not span page boundaries).

Rekall has a plugin called pas2vas which aims to solve this problem. It is a brute force plugin: simply enumerate all the virtual address to physical address mappings and build a large reverse mapping. This works well enough, but takes a while to construct the reverse mapping and because of this does not work well on live memory which is continuously changing.

Have you ever used the RamMap.exe tool from sysinternals? It’s an awesome tool which lets one see what each physical page is doing on your system. Here is an example screenshot (click on the image to zoom in):

This looks exactly like what we want! How does this magic work? Through understanding and parsing the Windows PFN database (Windows Internals) it is possible to relate a physical address directly to the virtual address in the process which owns it very quickly and efficiently. If only we had this capability in Rekall we could provide sufficient context to scanners in physical memory to work reliably and quickly!

Lets explore how one can use the Windows PFN database to ask exactly what each physical page is doing. In the end we implemented new plugins such as "pfn", "p2v" and "rammap" to shed more light on how physical memory is used within the Windows operating system. These plugins are integrated with other plugins (e.g. yarascan to assist in providing more contextual information for physical addresses).

Windows page translation.

We all know about the AMD64 page tables and how they work so I won't go into it in too much depth here. Just to say that the hardware needs page tables to be written in memory, which control how the page resolution process works. The CR3 register contains the Directory Table Base (DTB) which is the address of the top most table.

The hardware then traverses these tables, by masking bits off the virtual address until it reaches the PTE which contains information about the physical page representing that virtual address. The PTE may be in a number of states (which we described in detail in previous posts and are also covered in the Windows Internals book). In any state other than the "Valid" state, the hardware will generate a page fault to find the actual physical page.

Rekall has the "vtop" plugin (short for virtual-to-physical) to help visualize what page translation is doing. Let us take for example a Windows 7 image:

Here we ask Rekall to translate the symbol for "nt" (The kernel's image base) in the kernel address space into its physical address. Rekall indicates the address of each entry in the 4 level page table and finally prints the PTE content in detail. We can see that the PTE for this symbol is a HARDWARE PTE which is valid (i.e. the page exists in physical memory) and the relevant page frame number is shown (i.e. physical address).

The important points to remember about page translation are:

The page tables are primarily meant for the hardware. Addresses for valid pages in the page tables are specified as Page Frame Numbers (PFN) which means they are specified in the physical address space. This is the only thing the MMU can directly access.
On the other hand, the CPU cannot access physical memory. All CPU access occurs through the virtual address space of the kernel.

Invalid PTEs may carry any data to be used by the kernel, therefore typically those addresses are specified in the kernel's virtual address space.

All PTEs (and all parts of the page tables) must be directly mapped into the kernel's address space at all times so that the kernel may manipulate them.

Each PTE controls access to exactly one virtual memory page. The PTE is the basic unit of control for virtual pages, and each page in any virtual address space must have at least one PTE controlling it (This is regardless if the page is actually mapped into physical memory or not - the PTE will indicate where the page can be found).

There are two types of PTEs - hardware PTE and prototype PTEs. Prototype PTEs are used by the kernel to keep track of intended page purpose, and although they have a similar format to hardware PTEs, they are never accessed directly by the MMU. Prototype PTEs are allocated from pool memory, while hardware PTEs are allocated from System PTE address range.

Section objects

Consider a mapped file in memory. Since the file is mapped into some virtual address space, some pages are copied from disk into physical memory, and therefore there are some valid PTEs for that address space. At the same time, other pages are not read from disk yet, and therefore must have a PTE which points at a prototype PTE.

The _SUBSECTION object is a management structure which keeps track of all the mapped pages of a range from mapped files. The _SUBSECTION object has an array of prototype PTEs - a management structure similar to the real PTE.

Consider the figure above - a file is mapped into memory and Windows creates a _SUBSECTION object to manage it. The subsection has a pointer to the CONTROL_AREA (which in turn points to the FileObject which it came from) and pointers to the Prototype PTE array which represents the mapped region in the file. In this case a process is reading the mapped area and so the hardware PTE inside the process is actually pointing at a memory resident page. The prototype PTE is also pointing at this page.

Now imagine the process gets trimmed - in this case the hardware PTE will be made invalid and point at the prototype PTE. If the process tries to access the page again a page fault will occur and the pager will consult the prototype PTE to determine if the page is still resident. Since it is resident the hardware PTE will be just changed back to valid and continue to point at that page.

Note that in this situation, the physical page still contains valid file data, and it is still resident. It's just that the page is not directly mapped in any process. Note that it is perfectly OK to have another process with a valid hardware PTE mapping to the same page - this happens if the page is shared with multiple processes (e.g. a DLL) - one process may have the page resident and can access it directly, while another process might need to invoke a page fault to access this page (which should be extremely quick since the page is already resident).

The Page File Number Database (PFN).

In order to answer the question: What is this page doing? Windows has the Page File Number database (PFN Db). It is simply an array of _MMPFN structs which starts at the symbol "nt!MmPfnDatabase" and has a single entry for every physical page on the system. The MMPFN struct must be as small as possible and so consists of many unions and can be in several states - depending on the state, different fields must be interpreted.

Free, Zero and Bad lists

We start off with discussing these states - they are the simplest to understand. Pages which can take on any of these states (A flag in the _MMPTE.u3.e1.PageLocation) are kept in their own lists of Free pages (ready to be used), Zero pages (already cleared) or Bad pages (will never be used).

Active Pages: PteAddress points at a hardware PTE.

If the PFN Type is set to Active, then the physical page is used by something. The most important thing to realize is that a valid physical page (frame) must be managed by a PTE. Since that PTE record must also be accessible to the kernel, it must be mapped in the kernel's virtual address space.

When the PFN is Active, it contains 3 important pieces of information:

The virtual address of the PTE that is managing this physical page (in _MMPFN.PteAddress).
The Page Frame (Physical page number) of the PTE that is managing this physical page (in _MMPFN.u4.PteFrame). Note these two values provide the virtual and physical address of the PTE.
The OriginalPte value (usually the prototype PTE which controls this page). When Windows installs a hardware PTE from a prototype PTE, it will copy the original prototype PTE into this field.

Here is an example of Rekall's pfn plugin output for such a page:

The interesting thing is that in this case, the PTE that is managing this page will belong in the Hardware page tables created for the process which is using this page. That PTE, in turn will also be accessible by a PDE inside that process's page tables, and so forth. This occurs all the way up to the root of the page table (DTB or CR3) which is its own PTE.

Therefore if we keep following the PTE which controls each PTE 4 times we will discover the physical addresses of the DTB, PML4, PDPTE, PDE and PTE belonging to the given physical address. Since a DTB is unique to a process we immediately know which process owns this page.

Additionally we can also figure out what is the virtual address by subtracting the table offset of the real PTE from the start of the PTE table at each level of the paging structure, and assigning the relevant bits to that part of the virtual address. This is illustrated below.

So when the PFN is in this state (i.e. PteAddress pointing to a Hardware PTE) we can determine both the virtual address of this page and the process which maps it. It is also possible that another process is mapping the same page too. In this case the OriginalPTE will actually contain the _MMPTE_SUBSECTION struct which was originally filled in the prototype PTE. We can look at this value and determine the controlling subsection in a similar way to the below method.

Rekall's ptov plugin (short for physical-to-virtual) employs this algorithm to derive the virtual address and the owning process. Here is an example

We can verify this works by switching to the svchost.exe process context and converting the virtual address to physical. We should end up in the same physical address we started with:

Active Pages - PteAddress points at a prototype PTE.

Consider the case where two or more processes are sharing the same memory (e.g. mapping the same file). In order to aid in the management of this, Windows will create a subsection object as described earlier; if the virtual page is trimmed from the working set of one process, the hardware PTE will not be valid and instead point at the controlling subsection's prototype PTE.

In this case the PFN database will point directly at the prototype PTE belonging to the controlling subsection (The PFN entry will indicate that this is a prototype PTE with the _MMPFN.u3.e1.PrototypePte flag). Lets look at an example:

In this example, the PFN record indicates that a prototype PTE is controlling this physical page. The prototype PTE itself indicates that the page is valid and mapped into the correct physical page. Note that the controlling PTE for this page is allocated from system pool (0xf8a000342d50) while in the previous example, the controlling prototype was from the system PTE range (0xf68000000b88) and belonged to the process's hardware page tables.

If we tried to follow the same algorithm as we did before we will actually end up in the kernel's DTB because the prototype PTE is itself allocated from paged pool (so its controlling PTE will belong to the kernel's page tables). So in this case we need to identify the relevant subsection which contains the prototype PTE (and the processes that maps it).

When a process maps a file, it receives a new VAD region. The _MMVAD struct stores the following important information:

The start and end virtual addresses of the VAD region in the process address space.
The Subsection object which is mapped by this VAD region.
The first prototype in the subsection PTE array which is mapped (Note that VADs do not have to map the entire subsection, the first mapped PTE can be in the middle of the subsection PTE array. Also the subsection itself does not have to map the entire file either - it may start at the _SUBSECTION.StartingSector sector).

The _MMPFN.PteAddress will point at one of the Prototype PTEs. We build a lookup table between every VAD region in every process and its range of prototype PTEs. We are then able to quickly determine which VAD regions in each process contain the pointed to PTE address, and so we know which process is mapping this file.

The result is that we are able to list all the processes mapping a particular physical page, as well as the virtual addresses each use for it (using the _MMVAD information). We also can tell which file is mapped at this location (from the _SUBSECTION data and the filename) and the sector offset within the file it is mapped to. Here is the Rekall output for the ptov plugin:

Rekall is indicating that this page contains data from the oleaut32.dll at file offset 0x8a600. And as you can see in the output this data is shared with a large number of processes.

Putting it all together

We can utilize these algorithms to provide more context for scanning hits in the physical address space. Here is an example where I search for my name in the memory image using the yarascan plugin:

The first hit shows that this page belong to the rekall.exe process, mapped at 0x5522000. The second hit occurs at offset 0x177800 inside the file called winpmem_2.0.1.exe etc.

This information provides invaluable context to the analyst and helps reasoning about why these hits occur where they do.

The rammap plugin aims to display every page and what it is being used for (Click below to zoom in). We can some pages are owned by processes, some are shared by files and others belong to the kernel pools:

Other applications: Hook detection

Inline hooking is a very popular way to hijack the execution path in a process or library. Malware typically injects foreign code into a process then overwrites the first few bytes of some critical functions with a jump instruction to detour into its code. When the API function is called, the malware hijacks execution and then typically relays the call back to the original DLL to service the API call. For example it is a common way to hide processes, files or network connections.

Here is an example output from Rekall's hooks_inline plugin which searches all APIs for inline hooks.

In this sample of Zeus (taken from the malware analyst's cookbook), we can clearly see a jump instruction to be inserted at the start of critical functions (e.g. NtCreateThread) for Zeus to monitor calls to these APIs. Rekall detects the hooks by searching the first few instructions for constructs which divert the flow of execution (e.g. jump, push-ret etc).

Let us consider what happens in the PFN database when Zeus installs these hooks. Before the hook installation. the page containing the functions is mapped to the DLL file from disk. When Zeus installs the trampoline by writing on the virtual memory, Windows changes the written virtual page from file backed to a private mapping. This is often called copy-on-write semantics - Windows makes a copy of the mapped file private to the process whenever the process tries to write to the page. Even if the page is shared between multiple processes, only the process which wrote to it would see these changes.

Let's examine the PFN record of the hooked function. First we switch to the process context, then find the physical page which backs the function "ntdll!NtCreateThread" (Note we can use the function's name interchangeably with the address for functions Rekall already knows about).

Now let's display the PFN record (Note that a PFN is just a physical address with the last 3 hex digits removed):

Notice that the controlling PTE is a hardware PTE (which means it exists in the process's page tables). There is only a single reference to this page which means it is private (Share Count is only 1).

Let's now examine the very next page in ntdll.dll (The next virtual page is not necessarily the next physical page so we need to repeat the vtop translation - again we use Rekall's symbolic notation to refer to addresses):

And we examine the PFN record for this next page:

It is clearly a prototype page, which maps a subsection object. It is shared with 22 other processes (ShareCount = 22). Let's see how this physical page is mapped in the virtual address:

So the takeaway from this exercise is that by installing a hook, Zeus has converted the page from shared to a private mapping. If we assume that Zeus does not change files on disk, then memory only hooks can only exist in process private pages and not in shared file pages. It is therefore safe to skip shared pages in our hook analysis. This optimization buys a speedup of around 6-10 times for inline hook inspection - all thanks to the windows PFN database!

↧

Searching memory with Rekall

July 1, 2016, 2:03 pm

≫ Next: The Rekall Agent Whitepaper

≪ Previous: Rekall and the windows PFN database

This blog post covers the new searching capability in the latest Rekall release (Starting from Rekall 1.5.2). The searching capabilities in Rekall are powered by the Efilter project.

Customizing plugin output

Rekall is a plugin based framework. This means that Rekall comes with many plugins written by different contributors. For example, one of the most popular plugins is the pslist plugin. We can see some help about the pslist plugin by following it with a question mark (?):

We can see that by default, the plugin can filter the output by pid, or process name. These filters are common to all plugins which deal with processes. For example, suppose we want to only see the svchost.exe processes:

The pslist plugin has a typical tabular output. There are a number of columns which were pre-chosen by the plugin author, such as the address of the _EPROCESS struct, the process name, the pid etc.

This information is most useful, however, we don't have a way to customize the output much past those columns hard-wired into the plugin. For example, we might want to also sort the output based on the start time, or pid or something else.

How can we do this? The solution is implemented by the Efilter library and the search plugin implemented in Rekall.

Efilter and the search plugin.

Efilter is a filtering framework which implements an SQL-like search language. This approach is not new. For example other tools (such as Volatility) can produce output into sqlite tables, which can subsequently be filtered using SQL.

The main difference with the Efilter approach is that Efilter does not actually use pre-extracted data, but rather runs Rekall plugins on demand automatically in order to satisfy each query. As we see below, this allows queries to inspect data which was never even exported directly by the plugin - giving a complete and flexible interface for inspection of analysis results.

The general search process is illustrated above. Efilter analyzes the query to figure out which plugins will be run, then the output from these plugins is fed into the Efilter framework where the specified filters are applied. The query can specify a set of columns to display (and possibly their sorting order). The result is a customized tabular output governed by the specified query.

Here is a trivial example:

select * from pslist()

In this example we simply select all the output from the pslist() plugin (using the SQL * specifier). Efilter simply runs the pslist plugin and outputs all the rows produced by the plugin with no filtering. The output is also identical to the pslist plugin.

In the above query, Efilter just re-emitted all the columns emitted by the plugin but we can pick and choose some of the columns. Before we can filter and select some columns from the pslist plugin, we need to know exactly what type of output the plugin is producing. The column names are human readable and may not always correspond to the specific name of the column itself.

We use the describe plugin to describe the output of the pslist plugin in much the same way that the SQL describe statement describes a table:

Next to each field, we see the type of the field. This gives us an idea of what type of filtering operation is possible with this field. For example, let's display only the wow64 processes and show which session they are running in. The wow64 field is a boolean field so we simply evaluate it in the where clause:

select _EPROCESS, session_id, wow64 from pslist() where wow64

If you look closely at the output you might notice that the _EPROCESS column is actually split into 3 different columns, the virtual address, the process name and the process pid. Similarly in the output of "describe pslist" above there is no specific column for the process name or pid - all we see is a single _EPROCESS column with a type of _EPROCESS.

What is going on? How can we select only the process name?

When a plugin emits a more complex type (in this case the plugin emits raw _EPROCESS objects), Rekall might employ a specialized renderer (or a customized output format) for this columm. In this case a single _EPROCESS object is shown as a small table with three columns. However, Efilter actually sees the raw object itself. Therefore as far as the search plugin is concerned the pslist plugin emits raw _EPROCESS objects for each row.

We can use this fact by dereferencing fields inside the _EPROCESS object itself within a search query. Let us repeat the "describe pslist" plugin, but this time we tell it to show sub fields to a depth of 1 (In this screenshot we snipped many of the fields because the output is long. The plugin actually shows all the members of the _EPROCESS struct, as well as Rekall defined pseudo-members and properties.):

We can see that the _EPROCESS object itself contains many fields, and they can all be used as filtering targets, columns and sort orders. Here is something a bit more complex:

select _EPROCESS.name, _EPROCESS.pid, wow64 from pslist() where regex_search("svchost", _EPROCESS.name) order by _EPROCESS.pid desc

Here we sort by pid in reverse and show all the processes which match "svchost.exe" as well as their Wow64 status. Note that the built in search function finds the case insensitive regular expression anywhere in the filename.

Being able to drill into the objects returned by plugins allows users to invent completely different tables, even extending the output the original plugin was not designed to produce.

For example, closer inspection of the _EPROCESS object (either with the describe plugin or the dt plugin) reveals that extra information is present in the _SE_AUDIT_PROCESS_CREATION_INFO struct produced by the windows auditing system. We also see a FullPath member on the _EPROCESS object (This is actually a virtual member added by Rekall which displays the full path to the binary running the process). Let's find all the processes which were started from locations other than the Windows directory and also show the audit system's record of where they were started:

SELECT _EPROCESS.name, _EPROCESS.pid, _EPROCESS.FullPath, _EPROCESS.SeAuditProcessCreationInfo.ImageFileName.Name AS audit_name FROM pslist() WHERE NOT regex_search("Windows", _EPROCESS.FullPath)

Note the use of the regex_search() function applies a regular expression to a match, the use of the not operator to exclude this match and the use of the as operator to rename a column to a more meaningful name.

NOTE: Efilter currently provides the =~ operator for a regular expression match, however this matching is case sensitive. When matching windows file names we never want a case sensitive match or we might miss some filenames which should match. Therefore it is always preferable to use the case insensitive regex_search() function instead.

Subqueries

Sometimes it is useful to take the result of one search and apply it as an input to another plugin. Efilter supports this concept as a subquery. For example, suppose we asked - which processes were launched by a particular user? There is already a plugin which tells us this - the tokens() plugin:

select * from tokens() where regex_search("User: a", Comment)

We could also select by Sid but Rekall already resolves the Sid to a username for us. Now we would like to know which of these processes holds an open handle to the pmem driver?

select _EPROCESS, handle, access, details from handles(pids: (select Process.pid from tokens() where regex_search("User: a", Comment)).pid) where regex_search("pmem", details)

The above example is a bit hard to follow because it is all on one line and has a subselect clause. Efilter also allows us to save entire queries and therefore make the query more readable. To make this easier we can use the %%search magic command in the Rekall shell. This allows us to write more complex, multi-line queries:

[1] win7.elf 20:37:02> %%search

let user_a_processes=select Process.pid from tokens() where regex_search('User: a', Comment)

select _EPROCESS, handle, access, details from handles(pids: user_a_processes.pid) where regex_search('pmem', details)

How does this work?

The let assignment stores a query at the variable user_a_processes. Note that it does not execute the query at this point yet. A stored query is simply a table with columns and rows:

In this case there is only one column called "pid" and several rows.

The next query executes the handles plugin and provides the stored query to the "pids" parameter. Since the query is just a table, we need to choose which column to expand into the pids arguments (this is the purpose of the second ".pid"). Now the plugin will receive a list of pids to operate on.
The output from the handles plugin (restricted by the pids selected by the first query) is further filtered for a file handle matching "pmem"

Note that we could have just run the handles() plugin without arguments and filtered on the output but this would have been inefficient because Efilter will need to list all the handles for all the processes and then filter out those processes we don't care about. It is always important to try to reduce the total number of processes examined first off by providing good process selectors to plugins which take them. Efilter does not currently have a good feel as to the cost of running each plugin so it is up the user to decide which order the queries should be run in and how the output is to be combined.

In this blog post we have demonstrated how EFilter queries are useful to tailor exactly the output you need from Rekall plugins. In the next blog post we will discuss how to harness this power in formulating and recovering forensic artifacts for memory images.

↧

The Rekall Agent Whitepaper

October 26, 2016, 8:22 am

≫ Next: Rekall Agent Alpha launch

≪ Previous: Searching memory with Rekall

This post introduces the Rekall Agent - a new experimental IR/Forensic endpoint agent that appears in Rekall versions 1.6. The Rekall Agent will be officially released with the next major Rekall release but for now you can play with it by installing from git head using the following commands:

$ virtualenv  /tmp/MyEnv
New python executable in /tmp/MyEnv/bin/python
Installing setuptools, pip...done.
$ source /tmp/MyEnv/bin/activate
$ pip install -e ./rekall-core/ ./rekall-agent/ .

The Rekall Agent

Security agents running on managed systems are common and useful tools these days. There are quite a few offerings out there from commercial offerings like Tanium or Carbon Black to open source offerings like GRR and OSQuery.

We have been developing and using GRR for some time now and have gained quite a bit of experience in designing and operating the system. GRR is an excellent system and works very well, but over time it has become clear that there are aspects of the system which are lacking or that GRR does not perform as well as it should.

I recently re-examined much of the feedback we received about GRR and tried to think about some of the issues involved with deploying and operating GRR at very large scale. In particular I am focusing on open source deployments, and the accessibility of GRR to new users and operators. As part of this exercise I have tried to reimplement or rethink some of the GRR design decisions in order to make a system which is better able to do what users want GRR to do. This is not because what GRR does is necessarily bad or that its design is flawed, but rather that we learn from the experiences we gained in developing GRR in order to build a better, more scalable and easier to use system.

Some of the common complaints I heard about deploying GRR is that GRR is very complex to run and deploy. There is the choice of which data store to deploy (with different reliability/scalability characteristics), front end load balancing and provisioning the right amount of front end and worker capacity for the size of deployment and the expected workload. When things get very busy in a large hunt the front ends tend to get overloaded and clients receive HTTP 500 errors (which makes them back off but this also makes them temporarily inaccessible). Typically if a large hunt is running, it is not possible to do targeted incident response.

I wanted to design a system focused on collection only. Most users just want to export their data from the system and so the system should just do that and nothing else. It should be scalable and easy to deploy.

These are the following goals I wanted to achieve:

No part of the system should be in the critical path. The system should never deliver a HTTP 500 error to a client under any reasonable level of load. The client must never wait on any part of the system before completing its task.
The system should be easier to deploy at any scale. From small to large deployments it should be easy for the system to be deployed with minimal training.
The system should be able to do what users want from it. Users want to be able to do a full directory listing, they want to upload very large memory images. Users want to be able to search for a file glob in seconds not hours. Users want to schedule all clients in hunts in seconds not in hours.
The system should be simple. A user should be able to find what they want from the system without understanding internal system architecture. They should be able to easily export or post process all data. The system should be discoverable and obvious without having to resort to reading the code.
The system should scale well under load. If load suddenly increases the system should be able to handle it linearly (i.e. load doubles => processing times double). NOTE: goal 1 must be achieved even at high load.
The system should avoid copying data needlessly as much as possible. Either data has to be written into its final resting place in the first place, or data should be virtualized from multiple locations when read by the user.

This document illustrates the experimental implementation dubbed the Rekall Agent. It should not be considered a replacement to GRR (there are some features that GRR offers, that have not been implemented yet) but it is a proof of concept in trying to implement lessons learned from GRR and improve upon the GRR tool.

In the following document I shall refer back to these goals and try to illustrate how the new design as implemented with the Rekall Agent solves these issues.

The Rekall Agent Design

In the following description of the Rekall Agent, I will compare and contrast many details with the GRR design. This will hopefully clarify the new design and illustrates why it improves on the GRR design.

The Rekall Agent is a collection system

I observed that most users do not perform much analysis in GRR, preferring to export their data to other systems. For example, users want to export data to timelining systems (such as Plaso or Timesketch) or to large scale data mining environments like Elasticsearch.

Therefore, to simplify the design of the Rekall Agent we state our primary goal is that of collection and not analysis. Rekall itself performs client side analysis (for example extracting signals from memory) but the collected data is simply made available for exports with minimal post processing by the Rekall Agent. As we see below a major design pattern is that data is uploaded directly to its final resting place by the agent leaving the batch processors to perform very limited work, mainly collecting high level statistics.

Everything is a file!

In GRR everything is an AFF4 object. An AFF4 object is simply an object (it has behaviours and data) which is located in a particular URN or a path. For example aff4:/C.12345/vfs/os/Windows/System32/Notepad.exe is an AFF4 object containing information about the notepad binary on a specific client.

The Rekall Agent aims to simplify every aspect of the system, and what a better way to simply things than using the old Unix mantra of "everything is a file". Users understand files - they know what to do with them and because files are so convenient, there are already many systems designed and tested to process a lot of files, big and small.

In this regard Rekall is very similar to GRR - Files in Rekall are essentially the same as AFF4 objects in GRR. The client only knows and cares about downloading and uploading files. The server only knows about processing files within the filestore (e.g. the cloud bucket).

Let's look at a diagram of the Rekall Agent System:

The main components are the Rekall Agent which is installed on the thousands of systems being monitored, and the Agent Console used by the administrator of the system to control it. These two parts are connected via a shared file storage system such as a Cloud Bucket (For those users who wish to run their own infrastructure there is a stand alone HTTP file server they can run).

The Rekall agent operates in a loop:

Read a jobs file on a web server (if it has changed since last time).
Parse the jobs file for flows the client did not run previously.
Run any new jobs. While the jobs are run, the client can upload files to the server. For example:

A ticket is a small JSON file containing information about the currently running job. Tickets are read by various batch processors. A ticket can be thought of as a message sent by the client to the batch processor informing it of what it is doing.
A collection is a single file that contains many results. It is basically a SQLite file with results.

Google Cloud Storage

Because managing files is such a common thing, there are commercial services already designed to store and serve files. An example of such a service is Google Cloud Storage which is essentially a service designed to serve and receive files at scale and with minimal cost.

By simplifying the design such that everything simply moves files around we can easily use such a commercial service. For the open source user this simplifies deployment immensely because they do not need to worry about running servers, capacity planning, outages and other mundane issues. Of course for those users who actually want to run their own servers this is alway possible (see below) but for those who do not, GCS is a perfect service.

Deploying Rekall Agent in the Cloud.

Deploying Rekall Agent in the cloud takes three easy steps. First, create a new bucket to store your data:

Next create a service account for the Rekall console to manage the deployment. The service account should have the Storage Admin role at a minimum. You will need a new private key which should be fetched in JSON format. Store the JSON file somewhere on your disk.

NOTE: The service account key controls access to the bucket. It is required in order to delegate upload rights to clients (which have no credentials) and to be able to view results in the Rekall Agent console.

Finally you will need to create a new config file for both clients and servers. The config files contain keys for both clients and server, as well as some basic configuration. The agent_server_initialize_gcs plugin creates everything needed to run on GCS.

Now it is possible to run the agent client. Note that we do not need to deploy VMs, frontend servers, workers or anything really. The system is now completely set up and ready to go. We just need to install the Rekall Agents at the endpoints we want to monitor and point them at the correct config file.

Cloud storage cost model

Deploying a Rekall installation into a cloud bucket also gives predictable cost estimates. There are two types of chargeable operations made by the system:

Polling the jobs files is a class B operation charged currently at $0.01 per 10,000 queries.
A bucket listing operation is a class A operation charged at $0.20 per 10,000 queries. The agent worker issues one such request for each type of task it runs. In practice this cost is insignificant since the worker might only run batch jobs every few seconds.
The charge per stored GB.

The dominant cost of cloud deployments is the client poll interval. Clients poll their own job file and at least one other (the All hunt file and their label file). So costs rise linearly with the total number of clients and their polling frequency and the total number of labels.

More frequent polling means more responsible clients (i.e. a job issued to a client might be collected sooner). An example cost calculation assumes clients poll every 600 seconds (10 minutes) on 2 queues. So 12 queries per hour is 8640 queries per month. For a 10,000 client fleet this will cost at least $86 per month (i.e. when the system is completely idle).

Obviously when the system is used to actively collect data costs will rise from there depending on usage. For certain deployment situations such a well constrained cost model is a clear benefit.

Deploying Rekall Agent on self hosted servers.

Sometimes users prefer to deploy their own servers. There are some clear pros and cons:

Pros:

All data remains on users servers and not stored in the cloud.
For very large deployments there may be some cost advantage.

Cons:

Users must maintain their own infrastructure including availability, scaling and reliable high bandwidth connectivity. Load balancing.
Users must manage storage requirements. Ultimately this system writes files locally to disk so you need a large enough disk to hold all the data you will be gathering.

Nevertheless Rekall agent comes with its own HTTP server. To use it, simply initialize the configuration files using the agent_server_initialize_http plugin. You will need to provide an externally accessible URL for clients to connect.

Enrollment

The Rekall agent is a zero configuration agent. This means that there is no specific configuration required of the client before deployment. The client enrols automatically, generating its own keys, and unique client id. This process is called Enrollment.

Rekall simply performing an Interrogate action by itself without needing any server support. This means the Rekall agent is enrolled immediately and does not need to wait for the server. These are the Rekall agent's startup steps:

When the Agent starts, it reads the Manifest file. This file is signed by the server's private key and verified by a hard coded CA certificate embedded into the agent's configuration file. This first step ensures that the bucket is owned by the correct controller (since it is assumed that only the possessor of the relevant private key can task the client).
The Manifest file specifies some startup actions to be run on agent startup. By default the interrogate action runs on client startup.
The Agent collects information about itself (like OS version, hostname, IP address etc) and writes a Startup ticket. The ticket is eventually post processed by the Startup batch job to collect high level information and statistics about the client.
The client does not need to wait for the server. The client can simply go ahead and poll its own jobs queue (and any label based hunt queues) immediately and if there are any outstanding hunts the client will immediately participate in them.

This process can be seen in the following screenshot:

Note how the client polls several jobs files at once. The jobs files are termed Queues:

The client's private jobs queue is located under its own client id in the namespace. This contains flow requests specifically directed towards a particular client (1-1 messaging).
The client also polls a label related job (clients can carry several labels). All clients have the All label. Flows sent through these queues are directed at entire classes of agents (e.g. The All label is seen by all clients). This is how hunting is implemented. This is essentially a broadcast messaging system.
Note that the client can request to receive the jobs file only if it has been modified since the last time it requested it (using the HTTP If-Modifed-Since header). This means the first request downloads the file, but any subsequent requests do not transfer any data and therefore have zero cost on client or server. This allows clients to poll very frequently without any additional system load.

Flows

In GRR, a flow is a state machine which sends requests to the client waiting for responses from the client, then sending more requests and so on. Processing of client responses occur at any time throughout the flow.

Rekall simplifies this approach by splitting a flow into only two phases. The first phase of a flow runs on the client, while the second phase runs on the server (We call this phase post-processing). The client's part of the flow runs multiple client actions, creating and uploading one or more result collections (see below).

Once the first phase of the flow is complete, the client writes a ticket to a specified location. The FlowStat job processor then launches the second part of the flow in the batch processor. This part of the flow primarily performs post processing on the client's result collection.

Rekall flows can be in one of the following states:

Pending: The flow is waiting to be picked up by the client. It is written into the jobs file.
Started: The flow is currently worked on by the client. A Started ticket has been written by the client and the client has proceeded to execute the flow.
Done/Error (not post processed): The flow is complete by the client and the final ticket is written. In the case of an error the ticket includes a backtrace or error message.
Done (post processed): Post processing of results are run (if required) by the Rekall Agent worker.

Note that it is impossible for the worker to delay the client from completing the flow. The client simply executes the flow and returns all the results when it is done. Even if the worker is completely stopped this does not affect the client in any way because the worker post processing is not in the critical path (See below for a demonstration). The worker is essentially a batch job which may run post processing on the results with no time constraints.

Running a flow

Let's take a look in detail at an example jobs file. A job file contains a JSON encoded list of flows. Here is an example of one such flow (Encoded in YAML for easy reading):

- __type__: ListDirectory

actions:

- __type__: ListDirectoryAction

path: /usr/share/man

recursive: true

vfs_location:

__type__: GCSSignedURLLocation

bucket: rekall-temp

expiration: 1475916267

method: PUT

path: C.4dd70be22bc56fc3/vfs/collections/usr/share/man/F_c18ec2cfb1

signature: |

a1POInGl2lBB4CRwcKyEC2QWeFMZs92XCw4Ibiih+hQs6bqykulKU8Kh+q/67UDdRgKy

XXXXXZrJBg==

client_id: C.4dd70be22bc56fc3

created_time: 1475829867.746249

flow_id: F_c18ec2cfb1

path: /usr/share/man

recursive: true

session:

__type__: RekallSession

live: API

ticket:

__type__: FlowStatus

client_id: C.4dd70be22bc56fc3

flow_id: F_c18ec2cfb1

location:

__type__: GCSSignedPolicyLocation

bucket: rekall-temp

expiration: 1475916267

path_prefix: tickets/FlowStatus/F_c18ec2cfb1

path_template: '{client_id}'

policy: |

ey5kaXRpb25zIjogW1sic3RhcnRzLXdpdGgiLCAiJGtleSIsICJyZWthbGwtdGVtcC90aWNr

VDA4OjQ0OjI3Ljc0ODc1MiswMDowMCJ9

signature: |

I1f0oyEssNRyn10kuvK0XMwom1Ee0IVRMzlulxK/7I4PrhDw5T7TGZvqC4AUEPqQQwSumn+

XXXXXybRerfc/KcEA==

status: Started

In the above example we see that a ListDirectory flow was issued. The flow contains a list of actions, the first of which is a ListDirectoryAction. The client will run this action (which recursively lists the directory specified in the path parameter). When done, the client will upload the result collection to the vfs_location specified.

The vfs_location parameter is of type GCSSignedURLLocation which specifies a method to upload files to the Google Storage Bucket named in the location. It also includes exactly the expected path within the bucket and a signature block which is enforced by the GCS servers. Note that this information grants the Rekall Agent client permission to upload the result collection to exactly the specified URL for a specified time. The agent has no other credentials on the bucket and can not read or write any other objects. It is only the flow's creator that (using their service account keys) can grant access to the named object.

Similarly the agent is given a FlowStatus ticket to return to the server when it begins processing the flow. The ticket contains a GCSSignedPolicyLocation location allowing the client to write the ticket anywhere under the prefix tickets/FlowStatus/F_c18ec2cfb1 .

So to summarize this section, we have seen that:

Locations can specify different ways of uploading the file to the server. They also contain all the credentials required by the client to upload to a predetermined location. Since everything is a file in the Rekall Agent, everything deals with various Locations - so this is an important concept.
Clients have no credentials by themselves, they simply do as they are told and use the provided credentials to upload their results.
A client action is simply a dedicated routine on the client which runs a certain task, creates a result collection and uploads the result collection to the cloud.

Result collections

The concept of result collections is central in the Rekall Agent. Since in Rekall everything is a file, I was looking for a file based structured storage format, and the most widely known and recognized structured file format out there is SQLite. Using SQLite to store results is also reminiscent of the GRR SQLite data store, and we know that as long as any single SQLite file is not highly contended or it is not too large, then SQLite is a very good format. In fact, with the Rekall Agent architecture result collections are typically written once by the client, and then read multiple times by flow processors or the UI.

For example consider the task of storing a client's complete directory listing (e.g. in order to generate a timeline). On a typical client the directory listing is a few million files. Using a single deployment wide database will increase total row count by several million per client as discussed previously. If we keep the data in the large global table, that table will eventually grow and become slower as the system is used more and more.

However, Rekall creates a single sqlite file as a single result collection for one flow. This means that all the results from this flow are only stored in the one file. When the user wants to look at another flow, another database file will be used. This prevents the system from becoming slower as each SQLite database is isolated from all others, and it is only opened when the user explicitly wants to look at that specific result collection.

For example we can see the results of the ListDirectory flow above (using the view plugin):

The result is simply an SQLite table populated and returned from the client, containing stat() information about every file in the requested directory.

Note that the Rekall agent prepares the result collection by itself, and then uploads it at once to the server. There is no need for a worker to do anything with it other than just noting that this collection exists (i.e. maintain metadata). Once the file is uploaded, the worker may or may not post-process it but the client is not kept waiting. Even if a worker is not running, an end user of the system can just download the result collection manually.

Customized result collections

It is usually the case that deployed clients are hard to upgrade. In any real deployment there will be a certain number of clients running older versions of the software, and test/deployment cycles are difficult to speed up.

For example, Suppose that after the clients have been deployed, an analyst wants to use an additional condition on the FileFinder() flow or maybe they want to retrieve an extra field that we have not thought of. Do we need to deploy a new version of the Rekall Agent to do this?

The answer is usually no. The Rekall Agent is very flexible and typically does whatever the server asks of it. We get this flexibility thanks to the Efilter library, which implements a complete SQL query system within the Rekall Agent.

Usually flows are written with the full result collection specification provided to the client itself, as well as an Efilter query to run in different situations. For example, consider the ListProcesses flow. There is no dedicated ListProcesses client action, instead the flow simply specifies a generic SQLite collection and instructs the client to store inside it the results from an Efilter query:

__type__: ListProcessesFlow

actions:

- __type__: CollectAction

collection:

__type__: GenericSQLiteCollection

location:

__type__: GCSSignedURLLocation

bucket: rekall-temp

expiration: 1475972560

method: PUT

path: C.4dd70be22bc56fc3/vfs/analysis/pslist_1475886160

signature: |

mngmDMUb337BfhgzTfpRGA0P0IgmpU5PD69S02F27FQGy3+706c2tfR+kJBlFuBdzb9Tj

XXXXXg==

tables:

- __type__: Table

columns:

- __type__: ColumnSpec

type: unicode

- __type__: ColumnSpec

type: int

- __type__: ColumnSpec

type: int

- __type__: ColumnSpec

type: epoch

type: pslist

query:

mode_linux_memory: select proc.name, proc.pid, ppid, start_time from pslist()

mode_live_api: select Name as name, pid, ppid, start_time from pslist()

query_parameters: []

...

As in the previous example, the generic CollectAction takes a location of where to upload the collection, but this time the collection schema is fully described: In this example the columns name, pid, ppid and start_time will be returned. In order to do this, on Linux in live memory mode, the query "select proc.name, proc.pid, ppid, start_time from pslist()" will run, and in API mode the query "select Name as name, pid, ppid, start_time from pslist()" will run.

Suppose that in the next version of the Rekall Agent, we wish to write a new ListProcesses flow which implements different filtering rules, or reports back more (or less) columns in the result collection. To do this no new code needs to be deployed on the clients, the next version of the controller simply changes the query and result specification without needing to change anything on the client itself. Even an old client will adapt its output based on the new specification.

Of course the client needs to have the basic capability of listing processes (e.g. via the Rekall pslist plugin), but having an Efilter query dictate the output format and control the execution of existing plugins provides us with unprecedented runtime flexibility, allowing us to make maximum use of the existing client capabilities.

To summarize this section, we have seen that:

Rekall collections are SQLite files, the schema of which is specified from the server (so they can change with time if required).
The client fills the SQLite files with the output of the provided efilter query. The query filters and combines the output from other Rekall plugins in arbitrary flexible ways.

Hunts

A hunt is an operation which allows multiple clients to do the same flow at the same time. The results from the hunt are merged together and reported over the entire fleet. For example, a hunt might be run in order to search for a particular file glob or registry key across all machines.

Hunts are typically issued on some subset of clients (e.g. all windows machines or all machines of a given label).

GRR implements hunts via a routine in the frontend (called the foreman) which retrieves client information from the datastore (e.g. its Operating System) and issues separate flows for each client which matches the criteria. In other words, in GRR, it is the frontend that decides if a given client should receive the hunt. Because this decision process is relatively expensive (making frequent database queries) the foreman is only run on each client once every half an hour by default. This means that in practice hunts can not be run faster than half an hour, even if the hunt is instructed to schedule all clients immediately.

Furthermore because hunt processing is very expensive in GRR, the GRR foreman has to throttle hunt scheduling. The default client scheduling rate is 20 per minute (which is very low - for a 10k client deployment a hunt would take more than 8 hours). If the client tasking rate is too high the resulting client load can easily bring down the front end servers and overload the system. However in practice, it is difficult to accurately estimate how much load a particular hunt is going to generate, leaving the user to guess the appropriate client limit.

Rekall Agent does not use a foreman. Instead the hunt is just a regular flow with a condition specified. The condition is an efilter query which the client runs to determine if it should run the flow. For example the following efilter query restricts a hunt to Windows machines:

any from agent_info() where key=='system' and value=='Windows'

In this context agent_info() is simply a Rekall plugin (which delivers information about the client in key/value form). All clients will actually see this hunt, but only those on which the efilter query triggers will actually execute the flow.

The user's imagination is the limit for what other conditions might be used. For example, a hunt could be run on all systems which have the process named "chrome" running right now:

any from pslist() where regex_search(Name, "chrome")

Systems which do not match the condition simply ignore the hunt request.

Note that Rekall Agent does not run dedicated code on the server to start the hunt. A hunt is just a special kind of flow message posted on a shared message queue (job file), clients simply read the relevant message queue (job file) when it changes and decide for themselves if they should participate in the hunt. This means that in practice a Rekall hunt can complete in seconds because clients are not limited by the rate of scheduling the hunt or by the rate of hunt result processing. The hunt is essentially complete when the client uploads its results collection, barring any post processing required. For the first time, we are able to run a hunt which completes in seconds to capture the entire state of the fleet at the same time!

Let us now run the ListProcesses flow as a hunt:

__type__: ListProcessesFlow

actions:

- __type__: CollectAction

collection:

__type__: GenericSQLiteCollection

location:

__type__: GCSSignedPolicyLocation

bucket: rekall-temp

expiration: 1475978734

path_prefix: hunts/F_224c8bed27/vfs/analysis/pslist_1475892334

path_template: '{client_id}'

policy: |

eyJjb25kaXRpb25zIjogW1sic3RhcnpdGgiLCAiJGtleSIsICJyZWthbGwtdGVtcC9odW50

cXXX

signature: |

WADjE4ckez8Y+C4uZDUFlq+XbZwN1U+l8GHIxYpHt7cXFuFyZ6dyu9v/JUCpl+ach1SPrdW

XXXXXXXXPF7w==

...

ticket:

__type__: HuntStatus

client_id: C.4dd70be22bc56fc3

flow_id: F_224c8bed27

location:

__type__: GCSSignedPolicyLocation

bucket: rekall-temp

expiration: 1475978734

path_prefix: tickets/HuntStatus/F_224c8bed27

path_template: '{client_id}'

policy: |

VDAyOjA1OjM0LjgyMjYxNiswMDowMCJ9

signature: |

0op7DJn10Lys/tX5zuKohBQmIIOUQQKi3nWXKg==

The hunt request is not very different from a regular flow request. The main differences are:

Hunt results are stored inside the hunt namespace and not in the client namespace.

This keeps related hunt results together in the bucket namespace (so hunts can be purged).
The location specifies a path_template which the client interpolates, this allows multiple clients to write to the same part of the namespace without overwriting each other's collections. The client is allowed to write anywhere under the path_prefix.

The ticket that the client writes is a HuntStatus ticket instead of a FlowStatus ticket. The HuntStatus ticket manages overall statistics for the hunt in a specific hunt collection. This is just a different batch processor which organizes information in slightly different ways.

The Rekall Agent controller does not bother merging all the results into a single output collection. Instead we maintain another very small metadata collection containing high level information about the overall hunt progress (e.g. how many machines participated, how many errors, and where each machine uploaded its result collection).

If the user wishes to export the results of the hunt, the export() plugin simply opens all the client's collections on demand and streams the result into the exported collection or into the relevant export plugin.

So to summarize we have seen that:

Hunts are simply specially prepared flow (or job requests) which are written to a shared queue between multiple clients.
Participation in the hunt is based on client self selection (implemented via an Efilter query).
All results from a hunt of kept in the same part of the namespace on the filestore.
Exporting the hunt results merges individual clients' result collections into one final result collection.

The User Interface

Rekall Agent does not have a fancy GUI at present. Instead we use standard Rekall plugins to control the system. There is no user management yet or a restful API - all operations currently require full raw level access to the bucket (using the service account's credentials).

The Rekall UI allows one to inspect the status of the hunt. For example, for our process listing hunt above, the below screenshot shows the total number of clients responded, and a list of each client's result collection.

Since everything in the Rekall Agent bucket is just a file, it is sometimes easier to just list the bucket itself (the bucket can also be navigated using the Cloud Storage tools such as the Google Cloud Console and gsutil):

And the UI allows us to just view any file in the bucket directly. For example, in order to visualize what kind of metadata we keep about each of the clients in this hunt:

So we just keep all the hunt tickets for each client in a separate collection. The tickets contain the status message of running the flow, as well as the location where the result collection was written. The inspect_hunt plugin essentially uses data stored in this stats collection to tell us about the total clients that ran the hunt.

We can directly view the result collection from each client:

Evaluation

In order to understand how the new approach improves over GRR I examined a number of real world cases - typical of the way GRR is used.

Recursive FileFinder

It is very typical for users to recursively search the directory structure of a host, looking for files matching a certain name or timestamp. GRR has the FileFinder flow which makes issuing these requests easy, allowing users to issue Glob patterns.

in this example I am searching for the two glob expressions:

/home/scudette/**10/*.dll
/home/scudette/**10/*.exe

These search recursively from my home directory 10 levels deep for all files with .exe or .dll extensions.

GRR's traditional Glob algorithm is purely server based. Flows process the glob patterns into a suffix tree and then issue simple client primitives, like ListDirectory, Find etc.

Unfortunately GRR seems to have a bug at the moment with this feature (it seems to follow symlinks so it can never complete if there is a symlink pointing back up the directory tree) and so I could not complete this flow on my machine for comparison.

Rekall on the other hand performs much better:

As can be seen, the result collection is 150kb (about 2060 results) and takes about 11 seconds to compute the results and a couple of seconds to upload the file.

Just for comparison the Unix find command takes around 2 seconds on my system, so maybe there is some more room for optimization.

Running a hunt

Hunts are probably one of the most useful features of Rekall Agent. The ability to run a collection at the same time on all systems is important to be able to scale response. However, as noted before, the way hunts are run is inefficient and so GRR finds it difficult to scale.

To subjectively compare performance I designed the following experiment. GRR was installed on my system in the normal way with a single worker, adminUI, frontend. I also launched the GRR pool client (A tool used for benchmarking GRR by starting several separate GRR clients in the same process using different threads). The GRR pool client can be made to reuse a certificate file which means we can skip the enrollment phase and just run pre-enrolled multiple clients (Since Rekall Agent does not really have an enrollment workflow we can not compare the tools on this function). This essentially simulates a large deployment.

I started the GRR pool client with 100 separate client threads. I changed the poll interval to every 5 seconds in order to get the clients to be as responsive as possible (default is poll every 10 minutes). I then also changed the clients' foreman poll to 5 seconds (default is 30 minutes) to allow clients to pick up the hunt as quickly as possible.

For a hunt I chose a very simple and cheap action: List the /bin/ directory (FileFinder flow with Stat action of the glob /bin/*). On my system the /bin/ directory contains 164 files and we are only listing them (the equivalent of ls -l /bin/*).

The total time for all 100 clients to complete was about 1 minute and 4 seconds.

Testing Rekall

I tested Rekall in the same way. First I started the Rekall pool client with 100 threads and waiting a few minutes until the Startup process was complete. The agent worker was not running automatically. I then created a new hunt:

Note that as soon as this hunt was posted all clients immediately executed the hunt (Rekall does not use a hunt client rate). I waited a short time and then ran the agent worker to process the hunt status batch job. Rekall's hunt processing essentially just maintains metadata about the hunt (e.g. how many clients participated) but it is not critical to the hunt processing. The hunt is in fact complete as soon as the clients have uploaded their results.

However running the processor by hand demonstrates the worst case performance - all 100 hunt notifications are pending and should be processed at once. The processor took 4 seconds to process all the hunt notifications (it should be noted that all network traffic goes to the cloud so network latency is included in these times).

I then used Rekall's inspect_hunt plugin to plot the graph of client recruitment (similar to GRR's UI above):

As can be seen in the above graph, all clients completed their hunt in a little under 5 seconds (which was their maximum poll time). This makes sense since there is no server side code running to introduce client processing latency, so each client is operating independently from other clients.

Discussion

Although the Rekall Agent is not yet as featureful as GRR, it actually demonstrate some excellent advancements. In the following I discuss how the current architecture addresses the goals I set out to achieve.

No part of the system should be in the critical path

We define the critical path as the connection between the client and server. Unlike GRR, the Rekall Agent Worker (batch processor) is not absolutely required to run. Of course, we assume that the Google Cloud Storage infrastructure continues to operate as a general service, even under the load we exert on it.

Barring a major Google outage in a global service designed to serve millions of customers, and having clear SLAs, the Rekall Agent clients are assured to never receive any load related errors (e.g. HTTP 500 errors). This in itself is a major improvement to GRR.

Even if the Rekall Agent batch processing jobs do not work, the client's FlowStatus tickets are still stored in the bucket and will be processed as soon as the batch services are restored. In fact, the Rekall Agent UI is aware of this possibility and marks the flow with a (*) to indicate post processing has not completed. The below screenshot was taken after a flow was scheduled for a client, but the batch processor was not running. As soon as the client completes the flow it uploads a Done ticket and the UI notes that the flow is complete but not yet preprocessed:

Note that in this case the UI indicates the location of the client's ticket (which contains the final status and the list of result collections) so the user can still manually read the results if needed. The FlowStatus batch job simply maintains high level flow related statistics and it is not absolutely essential to the operation of the system.

We have also previously seen that when a hunt is issued all clients immediately respond to the hunt because they all read the same jobs file. Clients are then able to upload their results as soon as they complete executing the flow. No part of the system is in the critical path between client and server and clients should never receive a load related 500 error.

The system should be easier to deploy at any scale.

Whether the Rekall Agent is deployed to service few or many thousands of clients, the deployment procedure is exactly the same. The administrator simply creates a new bucket and prepares a new configuration file pointing the client at the new bucket. Depending on how much post-processing is required, the administrator can increase or decrease the capacity of the batch processor itself (but since typically Rekall batch processors only deal with metadata, not much capacity is required in the first place).

There is no need to deal with load balancing, data stores or high availability setup. The system transparently scales to whatever level is needed. SLAs are managed by the cloud provider.

The system should be able to do what users want from it.

Users sometimes want to take complex operations which produce a lot of data, such as a complete recursive directory listing of the entire system. Users sometimes want to make a timeline with Plaso or Timesketch, or even export data to elasticsearch. The current GRR architecture does not allow this, but the Rekall Agent does. Consider the following operation which recursively lists every file on my system:

On my machine, running this flow takes around 4 minutes. This is the output with debugging enabled:

The final result collection is a 300mb SQLite file containing around 1.6 million rows of individual file stat() information. However, Rekall stores it in the bucket as a single gzipped file which only takes around 46Mb of cloud storage. If a user wishes to export this file to Plaso they may now simply download the file locally, and load it into Plaso (or another tool as required).

The Rekall console UI can also use this file to display the client's Virtual File System with the vfs_ls() plugin. This is a simple file/directory navigator similar to that presented by GRR. The Rekall vfs_ls plugin simply queries the result collection directly.

As the debug messages reveal, Rekall queries the bucket to check if the collection is modified, and if it is not (i.e. the HTTP response status is 304 - not modified) it simply uses its local cached copy. This shows that even though the collection is very large, since the file is never updated once it has been written, the Rekall console can remain very responsive (Query served within 300ms after the first download) - simply because it is easy and efficient to cache files:

Note that this type of operation (Creating a full timeline of a remote system) is not currently possible in GRR because GRR is too slow and the load on the database is too large for such a flow to complete in a reasonable time.

The system should be simple.

At the end of the day the Rekall Agent system just deals with simple files. The Rekall Console UI allows one to list and view any file within the bucket. If the file is a collection, the UI just displays the tabular output (which can be filtered in arbitrary ways using an Efilter query). For example, we can simply list all the files inside the client's namespace using the bucket_ls plugin:

We can also just view every one of these files. For example, let's examine the client's metadata:

The user can also just export all the files from the bucket (using any cloud tool) and examine them with the sqlite binary itself - there is no magic in this system, everything really is a file.

The system should avoid copying data needlessly as much as possible.

In Rekall the client agent is responsible for creating and uploading the result collections. The client uploads the collection directly into the final resting place as often as possible. Although the system does have a facility for running post processing on the uploaded collections the system itself avoids moving or copying the actual bulk data from its upload location. Rather, the Rekall batch processing system maintains metadata about the uploaded collections and just uses it directly.

Note in particular the when exporting data from a hunt, the export plugin will merge the results from each client into the export directory. Unlike GRR, the Rekall worker does not actually look at the results from clients, the results are only merged at export time. This means that post processors do not need to manipulate a lot of data - instead they just maintain metadata about all the result collections.

↧

Rekall Agent Alpha launch

August 14, 2017, 12:34 am

≫ Next: ELF hacking with Rekall

≪ Previous: The Rekall Agent Whitepaper

DFRWS 2017 Release - Code named Hurricane Ridge

Last week at DFRWS 2017, we were proud to launch Rekall 1.7.0RC1 with the first alpha release of the Rekall Agent. The Rekall Agent is a distributed end point monitoring solution based on the Google Cloud Platform. The launch also includes a white paper and a demo site hosted at https://dfrws2017-rekall.appspot.com.

Apart from its scalability, enterprise grade access control and auditing, the Rekall Agent brings the capabilities of both Rekall's EFilter query language as well as OSQuery's capabilities into a complete distributed endpoint monitoring solution. The overall solution makes it possible to launch a flexible query on many endpoint systems at the same time, and collect their responses quickly and efficiently.

At DFRWS we worked through a hands on workshop where participants installed the Rekall Agent server on their own Google Cloud project, or alternatively we used a shared project for collaborative remote forensic investigations. We then ran through a number of common forensic and incident response scenarios.

The workshop had been so successful that people have requested the project to remain up after the workshop, so they could continue playing with the release.

Rekall Agent demo site

We have decided therefore to leave the site up as a demo site. This means anyone from the internet can request administrator rights for the application and quickly see how it works.

Simply navigate to https://dfrws2017-rekall.appspot.com/ and you should see the agreement page. Please note that since this is a shared system (in which anyone is an admin level user), all data can be viewed by anyone else.

By clicking the Make Me Admin button, the application will assign the Administrator and Viewer roles to your username. The Viewer role is needed so you can login and search for clients. The Administrator role allows you to assign any other roles to your username as needed.

Let's start off by searching for clients (For a small installation such as this, simply search for label:All to list all the clients):

Now to view the flows in each client, simply click on the flows column.

Since you do not currently have approval to access this client, the UI will start the approval request workflow. You can request an Examiner (read only) or Investigator (can launch flows) roles on this client. The approver list shows all users with approver rights. If you do not appear in this list you will need to grant yourself the approver role by clicking the USERS menu then Add:

Once the approval request is sent, you can see it in your inbox at the top right of the screen.

Adding new machines to the demo site

You are welcome to add new agent machines to the Rekall demo site, and use the demo site to run Rekall plugins on them. Note that since this is a shared site and you are effectively granting it root access you should only use disposable machines (e.g. VMs). In fact you should probably run the agent as a non-root user.

First install both the Rekall Agent and Rekall Forensic packages as distributed above

Then copy the sample configuration file:

And replace the location with the demo site:

You should run the agent as non root by hand using the command line (you may need to change the writeback path to somewhere it can write to):

Now you can launch any plugin against your client. For example, to launch an OSQuery query simply select the "osquery plugin" then enter the query:

After waiting for the Agent to pick up the new query, the flow will be marked as Done:

We can view the collection which is the result of the query

Feedback

This is the first Alpha release of Rekall Agent - we put it out there to solicit comments, thoughts and discussions. Please mail us at rekall-discuss@googlegroups.com with any suggestions for further improvements.

↧

ELF hacking with Rekall

January 9, 2018, 1:42 am

≫ Next: Virtual Secure Mode and memory acquisition

≪ Previous: Rekall Agent Alpha launch

Imagine you are responding to an incident on an old Linux server. You log into the server, download your favorite Linux incident response tool (e.g. Linpmem, aff4imager), and start collecting evidence. Unfortunately the first thing you see is:

Argh! This system does not have the right library installed! Welcome to Linux's own unique version of DLL hell!

This is sadly, a common scenario for Linux incident response. Unfortunately we responders, can not pick which system we will investigate. Some of the compromised systems we need to attend to are old (maybe thats is why we need to respond to them :). Some are very old! Other systems simply do not have required libraries installed and we are rarely allowed to install required libraries on production machines.

In preparation to the next release of AFF4 imagers (including Linpmem, Winpmem and OSXPmem), I have been testing the imagers on old server software. I built an Ubuntu 10.04 (Lucid Lynx) system for testing. This distribution is really old and has been EOLed in 2015, but I am sure some old servers are still running this.

Such an old distribution helps to illustrate the problem - if we can get our tools to work on such an old system, we should be able to get them to work anywhere. In the process I learned a lot about ELF and how Linux loads executables.

In the rest of this post, I will be using this old system and trying to get the latest aff4imager tool to run on it. Note, that I am building and developing the binary on a modern Ubuntu 17.10 system.

Statically compiling the binary.

As shown previously, just building the aff4 imager using the normal ./configure, make install commands will build a dynamically linked binary. The binary depends on a lot of libraries. How many? Let's take a look at the first screenfull of dependencies:

It is obviously not reasonable to ship such a tool for incident response. Even on a modern system this will require installing many dependencies. On servers and managed systems it may be against management policy to install so many additional packages.

The first solution I thought of is to simply build a static binary. GCC has the -all-static flag which builds a completely static binary. That sounds like what we want:

-all-static

If output-file is a program, then do not link it against any shared libraries at all.

Since modern linux distributions do not ship any static libraries, we need to build every dependency from source statically. This is mostly a tedious exercise but it is possible to create a build environment with the static libraries installed into a separate prefix directory. Adding the -all-static flag to the final link step will produce a fully static binary.

Excellent! This built a static binary as confirmed by ldd and file. Unfortunately this binary crashes, even on the modern system it was built on. If we use GDB we can confirm that it crashes as soon as any pthread function is called. It seems that statically compiling pthread does not work at all.

If we try to run this binary on the old system, it does not even get to start:

The program aborts immediately with "kernel too old" message. This was a bit surprising to me but it seems that static binaries do not typically run on old systems. The reason seems to be that glibc requires a minimum kernel version, and when we statically compile glibc into the binary, it refuses to work on older kernels. Additionally some functions simply do not work when statically compiled (i.e. pthread). For binaries which do not use threads, an all static build does work on modern enough kernel, but obviously it does not achieve the intended effect of being able to run on all systems since it introduces a minimum requirement on the running kernel.

Building tools on an old system

A common workaround is to build the tools on an old system, and perhaps ship multiple versions of the tool. For example, we might build a static binary for Ubuntu 10.04. Ubuntu 10.10 … all the way through to modern Ubuntu 17.10 systems.

Obviously this is a lot of work in setting up a large build environment with all the different OS versions. In practice it is actually very difficult to achieve if not impossible. For example, the AFF4 library is written using modern C++11. The GCC versions shipped on such old systems as Ubuntu 10.04 do not support C++11 (The standard has not even been published when these systems were released!). Since GCC is intimately connected with glibc and libstdc++ it may not actually be possible to install a modern gcc without also upgrading glibc (and thereby losing compatibility with the old glibc which is the whole point of the exercise!).

Building a mostly static binary

Ok, back to the drawing board. In the previous section we learned that we can not statically link glibc into the binary because glibc is tied to the kernel version. So we would like to still dynamically link to the local glibc, but not much more. Ideally we want to statically compile all the libraries which may or may not be present on the system, so we do not need to rely on them being installed. We use the same static build environment as described above (so we have static versions of all dependent libraries).

Instead of specifying the "-all-static" flag as before we can specify the "-static-libgcc" and "-static-libstdc++" flags:

-static-libstdc++

When the g++ program is used to link a C++ program, it will normally automatically link against libstdc++. If libstdc++ is available as a shared library, and the -static option is not used, then this will link against the shared version of libstdc++. That is normally fine. However, it is sometimes useful to freeze the version of libstdc++ used by the program without going all the way to a fully static link. The -static-libstdc++ option directs the g++ driver to link libstdc++ statically, without necessarily linking other libraries statically.

This is looking better. When we build our aff4imager with these flags we get a dynamically linked binary, but it has a very small set of dependencies - all of which are guaranteed to be present on any functional Linux system:

This is looking very promising! The binary works on a freshly installed modern Linux system since it links to libpthread dynamically, but does not require any external libraries to be installed. Lets try to run it on our old system:

Argh! What does this mean?

Symbol versioning

Modern systems bind binaries to specific versions of each library function exported, rather than to the whole library. This is called Symbol Versioning.

The error reported by the old system indicates that the binary requires a particular version of GLIBC symbols (2.14) which is not available on this server. This puts a limit of the oldest system that we can run this binary on.

Symbol versioning means that every symbol in the symbol table requires a particular version. You can see which version each symbol requires using the readelf program:

Before the binary is loaded, the loader makes sure it has all the libraries and their versions it needs. The readelf program can also report this:

ELF parsing with Rekall

To understand how the ELF file format implements symbol versioning we can use Rekall to implement an ELF parser. The ELF parser is a pretty good example of Rekall's advanced type parser, so I will briefly describe it here. Readers already familiar with Rekall's parsing system, can skip to the next section.

The Rekall parser is illustrated in the diagram below

The Address Space is a Rekall class responsible for representing the raw binary data. Rekall's binary parser interprets the raw binary data through the Struct Parser class. Rekall automatically generates a parser class to decode each struct from three sources of information:

1. The struct definition is given as a pure JSON structure (This is sometimes called the VTypes language). Typically this is automatically produced from debugging symbols.
2. The overlay. This python data structure may introduce callables (lambdas) to dynamically calculate fields, offsets and sizes overlaying on top of the struct definition.
3. If more complex interfaces are required, the user can also specify a class with additional methods which will be presented by the Struct Parser class.

The main emphasis of the Rekall binary parser is on readability. One should be able to look at the code and see how the data structures are constructed and related to one another.

Let us examine a concrete example. The elf64_hdr represents the ELF header.

The VType definition describes the struct name, its size, and each known field name, its relative offset within the struct as well as its type. This information is typically generated from debug symbols.

"elf64_hdr": [64, {
'e_ident': [0, ['array', 16, ['uint8_t']]],
'e_type': [16, ['uint16_t']],
'e_machine': [18, ['uint16_t']],
'e_version': [20, ['uint32_t']],
'e_entry': [24, ['uint64_t']],
'e_phoff': [32, ['uint64_t']],
'e_shoff': [40, ['uint64_t']],
'e_flags': [48, ['uint32_t']],
'e_ehsize': [52, ['uint16_t']],
'e_phentsize': [54, ['uint16_t']],
'e_phnum': [56, ['uint16_t']],
'e_shentsize': [58, ['uint16_t']],
'e_shnum': [60, ['uint16_t']],
'e_shstrndx': [62, ['uint16_t']],
}],

The overlay adds additional information into the struct. Where the overlay has None, the original VType's definition applies. This allows us to support variability in the underlying struct layout (as is common with different software version) without needing to adjust the overlay. For example, in the below example, the e_type field was originally defined as an integer, but it really is an enumeration (since each value means something else). By defining this field as an enumeration Rekall can print the string associated with the value. The overlay specifies None in place of the field's relative offset to allow this value to be taken from the original VTypes description.

"elf64_hdr": [None, {
'e_type': [None, ['Enumeration', {
"choices": {
0: 'ET_NONE',
1: 'ET_REL',
2:'ET_EXEC',
3:'ET_DYN',
4:'ET_CORE',
0xff00:'ET_LOPROC',
0xffff:'ET_HIPROC'},
'target': 'uint8_t'}]],

'sections': lambda x: x.cast("Array",
offset=x.e_shoff,
target='elf64_shdr',
target_size=x.e_shentsize,
count=x.e_shnum),
}],

Similarly, we can add additional "pseudo" fields to the struct. When the user accesses a field called "sections", Rekall will run the callable allow it to create and return a new object. In this example, we return the list of ELF sections - which is just an Array of elf64_shdr structs specified at offset e_shoff.

While overlays make it very convenient to quickly add small callables as lambda, sometimes we need more complex methods to be attached to the resulting class. We can then define a proper Python class to support arbitrary functionality.

class elf64_hdr(obj.Struct):
def section_by_name(self, name):
for section in self.sections:
if section.name == name:
return section

In this example we add the section_by_name() method to allow retrieving a section by its name.

We can now parse an ELF file. Simply instantiate an address space using the file and then overlay the header at the start of the file (offset 0). We can then use methods attached to the header object to navigate the file. In this case print all the section headers.

Symbol Versioning in ELF files

So how is symbol versioning implemented in ELF? There are two new sections defined: The ".gnu.version" (SHT_GNU_versym) and the ".gnu.version_r" (SHT_GNU_verneed).

Let's examine the SHT_GNU_verneed section first. Its name is always ".gnu.version_r" and it consists of two different structs - the elf64_verneed struct is a linked list essentially naming the library files needed (e.g. libc.so.6). Each elf64_verneed struct is also linked to a second list of elf64_vernaux structs naming the versions of the library needed.

This arrangement can be thought of as a quick index of all the requirements of all the symbols. The linker can just quickly look to see if it has all the required files and their symbols before it even needs to examine each symbol. The most important data is the "other" number which is a unique number referring to a specific filename + version pair.

The next important section is the SHT_GNU_versym section which always has the name ".gnu.version". This section holds an array of "other" numbers which line up with the binaries symbol table. This way each symbol may have and "other" number in its respective slot.

The "other" number in the versym array matches with the same "other" number in the verneed structure.

We can write a couple of Rekall plugins to just print this data. Let's examine the verneed structure using the elf_versions_needed plugin:

Note the "other" number for each filename + version string combination.

Lets now print the version of each symbol using the elf_versions_symbols plugin.

Fixing the Symbol Versioning problem

Ok so how do we use this to fix the original issue? While symbol versioning has its place, it prevents newer binaries from running on old systems (which do not have the required version of some functions). Often however the old system has some (older but working) version of the required method.

If we can trick the system into loading the version it has instead of the version the binary requires, then the binary can be linked at runtime and would work. It will still invoke the old (possibly buggy) version of these functions, but that is what every other software on the system will be using anyway.

Take for example the memcpy function which on the old system is available as memcpy@2.2.5. This function was replaced with a more optimized memcpy@2.14 which new binaries require breaking, backwards compatibility. Old binaries running on the old system still use (the possibly slower) memcpy@2.2.5, so if we make our binary use the old version instead of the new version it would be no worse than existing old binaries (certainly if we rebuilt our binary on the old system it would be using that same version anyway). We don't actually need these optimizations so it should be safe to just remove the versioning in most cases.

Rekall also provides write support. Normally, Rekall operates off a read only image, so it is not useful, but if Rekall operates on a writable media (like live memory) and write support is enabled, then Rekall can modify the struct. In python this works by simply assigning to the struct field.

So the goal is to modify the verneed index so it refers to acceptably old enough versions. The verneed section is an index of all the symbol versions and the loader checks these before even loading the symbol table.

We can then also remove the "other" number in the versym array to remove versioning for each symbol. The tool was developed and uploaded to the contrib directory of the Rekall distribution. As can be seen after the modifications the needed versions are all reset to very early versions.

One last twist

Finally we try to run the binary on the old system and discover that the binary is trying to link a function called clock_gettime@GLIBC_2.17. That function simply does not exist on the old system and so fails to link (even when symbol versioning is removed).

At this point we can either:

Map the function to a different function
Implement a simple wrapper around the function.

By implementing a simple wrapper around the missing function and statically linking that into the binary we can be assured that our binary will only call functions which are already available on the old system. In this case we can implement clock_gettime() as a wrapper to the old gettimeofday().

int clock_gettime(clockid_t clk_id, struct timespec *tp) {
struct timeval tv;
clk_id++;

int res = gettimeofday(&tv, NULL);
if (res == 0) {
tp->tv_sec = tv.tv_sec;
tp->tv_nsec = tv.tv_usec / 1000;
}
return res;
}

The moment of truth

By removing symbol versioning we can force our statically built imager to run on the old system. Although it is not often that one may need to respond to an old system, when the need arises it is imperative to have ready to use and tested tools.

Other tools

I wanted to see if this technique was useful for other tools. One of my other favourite IR tools is OSQuery which is distributed as a mostly statically built binary. I downloaded the pre-built Linux binary and checked its dependencies.

There are a few more dependencies but nothing which can not be found on the system.

Again the binary requires a newer version of GLIBC and so will not run on the old system:

I applied the same Rekall script to remove symbol versioning. However, this time, the binary still fails to run because the function pthread_setname_np is not available on this system.

We could implement a wrapper function like before, but this will require rebuilding the binary and this takes some time. Reading up on the pthread_setname_np function it does not seem important (it justs sets the name of the thread for debugging). Therefore we can just re-route this function to some other safe function which is available on the system.

Re-routing functions can be easily done by literally loading the binary in a hex editor and replacing "pthread_setname_np" with "close" inside the symbol table^[1].

References

http://www.lightofdawn.org/wiki/wiki.cgi/NewAppsOnOldGlibc

https://en.wikipedia.org/wiki/Executable_and_Linkable_Format

[1] This is generally safe because calling conventions on x64 rely on the caller to manage the stack, and close() will just interpret the thread id as a filehandle which will be invalid. Of course extensive tool testing should be done to reveal any potential problems.

↧

Virtual Secure Mode and memory acquisition

September 17, 2018, 6:30 am

≪ Previous: ELF hacking with Rekall

This blog post is about my recent work to track down and fix a bug in Winpmem - the open source memory acquisition utility. You can try the latest release at the Velocidex/c-aff4 release page (Note Winpmem moved from the Rekall project into the AFF4 project recently).

The story starts, as many do, with an email send to the SANS DFIR list asking about memory acquisition tools to use for VSM protected laptops. There are a number of choices out there, these days, but some readers proposed the old favorite: WinPmem. WinPmem is AFAIK the only open source memory acquisition tool. It was released quite some years ago and had not been updated for a while. However, it has been noted that it crashes on windows systems with VSM enabled.

In 2017 Jason Hale gave an excellent talk at DFRWS where he put a number of memory acquisition tools through their paces and found that most tools blue screened when run on VSM enabled systems (Also a blog post). Subsequently some of these tools were fixed but I could not find any information as to how they were fixed. The current state of the world is that Winpmem appears to be one of those tools that crashes.

When I heard these reports I finally blew the dust off my windows machine and enabled VSM. I then tried to acquire with the released winpmem (v3.0.rc3) and I can definitely confirm it BSODs right away! Not good at all!

At first I thought that the physical memory range detection is a problem. Maybe under VSM the memory ranges we detect with WinPmem are wrong and this makes us read into invalid physical memory ranges? Most tools get the physical memory ranges using the undocumented kernel function MmGetPhysicalMemoryRanges() - which is only accessible in kernel mode.

To confirm, I got winpmem to just print out the ranges it got from that function using the -L flag. This flag makes winpmem just load the driver, report on the memory ranges and quit without reading any memory (so it should not crash).

These memory ranges look a bit weird actually - they are very different from what the same machine reports without VSM enabled.

The first thing I checked was another information source for the physical memory ranges. I wrote a Velociraptor artifact to collect the physical memory ranges as reported by the hardware manager in the registry (Since MmGetPhysicalMemoryRanges() is not accessible from userspace so the system writes the memory ranges into the registry on each boot). Note that you can also use meminfo from Alex Ionescu to dump memory ranges from userspace.

So this seems to agree with what the driver reports.

Finally I attempted to take the image with DumpIt - this is a great memory imager which did not crash either:

The DumpIt program produced a windows crash dump image format. It is encouraging to see that the CR3 value reported by DumpIt was consistent with WinPmem's report. While DumpIt does not explicitly report the physical memory ranges it used, since it uses a crashdump format file, we can see the ranges encoded into the crashdump using Rekall:

So at this point I was convinced that the problem is unrelated to reading outside valid physical memory ranges - even DumpIt is using the same ranges as WinPmem but it does not crash!

Clearly any bug fix will involve updating the driver components (since all imager's userspace component just reads from the same ranges anyway). It had been years since I did any kernel development and my code signing certificate expired many years ago. These days code signing certs are much more expensive since they require EV certs. I was concerned that unsuspecting users would attempt to use WinPmem on VSM enabled systems (which are becoming more common) and cause BSOD. This poses an unacceptable risk.

I updated the DFIR list announcing that in my estimation we would need to fix the issue with the kernel driver and since my code signing cert expired I won't be able to do so. I felt the only responsible thing was to pull the download and advise users to consider an alternative product - at least for the time being.

What happened next was an amazing response from the DFIR community. I receives many emails generously offering to sponsor a code signing cert and people expressing regret over the possibility of Winpmem winding down. Finally Emre Tinaztepe from Binalyze has graciously offered to sign the new drivers and contribute them to the community. Thanks Emre! It is really awesome to see open source at work and the community coming together!

Fixing the bug

So I set out to see why Winpmem crashes. To do kernel development, one needs to install Visual Studio with the Driver Development Kit (DDK). This takes a long time and downloads several GB worth of software!

After some time I had the development environment set up, checked out the winpmem code and tried to replicate the crash with the newly built driver. To my surprise it did not crash and worked perfectly! At this point I was confused….

I turns out that the bug was fixed back in 2016 but I had totally forgotten about it (a touch of senility no doubt). I then forensically analyzed my email to discover a conversation with Jason Hale back in October 2016! Unfortunately my code signing cert ran out in August 2016 and so although the fix was checked in, I never rebuilt the driver for release - my bad :-(.

Taking a closer look at the bug fix, the key is this code:

  char *source = extension->pte_mmapper->rogue_page.value + page_offset;
  try {
    // Be extra careful here to not produce a BSOD. We would rather
    // return a page of zeros than BSOD.
    RtlCopyMemory(buf, source, to_read);
  } except(EXCEPTION_EXECUTE_HANDLER) {
    WinDbgPrint("Unable to read from %p", source);
    RtlZeroMemory(buf, to_read);
  }

The code attempts to read the mapped page, but if a segfault occurs, rather than allow it to become a BSOD, the code traps the error and pads the page with zeros. At the time I thought this was a bit of a cop out - we don't really know why it crashes, but if we are going to crash, we would rather just zero pad the page than BSOD. Why would reading a page from physical RAM backed memory segfault?

To understand this we must understand what Virtualization Based Security is. Microsoft windows implements Virtual Secure Mode (VSM):

VSM leverages the on chip virtualization extensions of the CPU to sequester critical processes and their memory against tampering from malicious entities. … The protections are hardware assisted, since the hypervisor is requesting the hardware treat those memory pages differently.

What this means in practice, is that some pages belonging to sensitive processes are actually unreadable from within the normal OS. Attempting to read these pages actually generates a segfault by design! Of course in the normal course of execution, the VSM container would never access those physical pages because they do not belong to it, but the hypervisor is actually implementing an additional hardware based check to make sure the pages can not be accessed. The above fix works because the segfault is caught and WinPmem just moves on to the next page avoiding the BSOD.

So I modified Winpmem to report all the PFNs that it was unable to read. Winpmem now prints a list of all unreadable pages at the end of acquisition:

You can see that there are not many pages protected by the hypervisor (around 25Mb in total). The pages are not contained in a specific region though - they are sprayed around all of physical memory. This actually makes perfect sense, but let's double check.

I ran the kernel debugger (windbg) to see if it can read the physical pages that WinPmem can not read. Windbg has the !db command which reads physical memory.

Compare this output to the previous screenshot which indicates Winpmem failed to read PFN 0x2C7 and 0x2C9 (but it could read 0x2C8). Windbg is exactly the same - it too can not read those pages presumably protected by VSM.

Representing unreadable pages

This is an interesting development actually. Disk images have for a long time had the possibility of a sector read error, and disk image formats have evolved to account for such a possibility. Sometimes when a sector is unreadable, we can zero pad the sector, but most forensic disk images have a way of indicating that the sector was actually not readable. But this never happened before in memory images, because it was inconceivable that we could not read from RAM (as long as we had the right access!).

Winpmem uses the AFF4 format by default, and luckily AFF4 was designed specifically with forensic imaging in mind (See section 4.4. Symbolic stream aff4:UnreadableData). Currently the AFF4 library fills an unreadable page with "UNREADABLE" string to indicate this page was protected.

Note that if the imaging format does not support this (e.g. RAW images or crashdumps) than we can only ever zero pad unreadable pages and then we don't know if these pages were originally zero or were unreadable.

PTE Remapping Vs. MmMapIoSpace

Those who have been following winpmem for a while are probably aware that winpmem uses a technique called PTE remapping by default. All memory acquisition tools have to map the physical memory into the kernel's virtual address space - after all software can only ever read the virtual address space, and can not access physical memory directly.

There are typically 4 techniques for mapping physical memory to virtual memory:

Use of \\.\PhysicalMemory device - this is an old way of exporting physical memory to userspace and has been locked down for ages by Windows.
Use of the undocumented MmMapMemoryDumpMdl() function to map arbitrary physical pages to kernel space.
Use of the MmMapIoSpace() function. This function is typically used to map PCI devices DMA buffers into kernel space.
Direct PTE remapping - the technique used by WinPmem.

Memory acquisition is by definition an incorrect operation when viewed from the point of view of OS stability. For example, running Winpmem's MmMapIoSpace() mode under the driver verifier will produce a bugcheck 0x1a with code 0x1233 "A driver tried to map a physical memory page that was not locked. This is illegal because the contents or attributes of the page can change at any time."

This is technically correct - we are mapping some random pages of system physical memory into our virtual memory space - but we don't actually own these pages (i.e. we never locked them) so anyone could at any time just go ahead and use them from under us. Obviously if we want to write a kernel driver then by definition mapping physical memory we do not own is bad! Yet this is what memory acquisition is all about!!1

The other interesting issue is caching attributes. Most people's gut reaction is that direct PTE manipulation should be less reliable since it essentially goes behind the OS's back and directly maps physical pages ignoring memory caching attributes. The reality is that the OS performs a lot of extra checks which are designed to trap erroneous use of APIs. For example, it may fail a request for MmMapIoSpace() if the cache attributes are incorrect - this makes sense if we are actually trying to read and write to the memory and depend on cache coherence, but if we are just acquiring the memory it doesn't matter - we would much rather get the data than worry about caching. Because we are going to have a lot of smear in the image anyway, caching really does not affect imaging in any measurable way.

Winpmem implements both PTE Remapping and MmMapIoSpace() modes. MmMapIoSpace() typically fails to collect more pages though. Here is the output from the MmMapIoSpace() method vs. PTE Remapping method side by side:



Total unreadable pages with PTE Remapping	Total Unreadable pages with MmMapIoSpace() function.

As can be seen, the MmMapIoSpace() method fails to read more pages, and the pages it fails to read are still accessible as proven by the kernel debugger.

Final notes

This post just summarises my understanding of how memory acquisition with VSM works. I may have got it completely wrong! If you think so, please comment below or on the DFIR/Rekall/Winpmem mailing list. It kind of makes sense to me that winpmem can not read protected pages, and even the kernel debugger cannot read those same pages. However, I checked the image that DumpIt produced, and it actually contains data (i.e. not all zeros) in the same pages which are supposed to be unreadable. The other interesting thing is that DumpIt reports no errors reading pages (see the screenshot above), so it claims to have read all pages (including those protected by VSM). I don't know if DumpIt implements some other novel way to read through the hypervisor restrictions or maybe it has a bug like an uninitialized buffer so these failed pages return random junk? Or maybe I have a deep misunderstanding of the whole thing (most likely). More tool testing is required!

Thanks!

I would like to extend my gratitude to Emre Tinaztepe from Binalyze for his patient testing and signing the drivers. Also thanks for Matt Suiche for feedback and doing a lot of excellent original research and producing such cool tools.

↧