Surprising Ctags Behaviour

Published on 2021-01-10 to joshleeb's blog

Today I came across some interesting behaviour implemented by ctags (specifically Universal Ctags) with the way some tag entries are handled. At first I was surprised. Then thinking about a bit more I can understand why. But still, I feel there must be a better way.

Let’s take a look at some Rust code.

struct Foo;

impl Debug for Foo {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        ...
    }
}

struct Bar;

impl Debug for Bar {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        ...
    }
}

Looks pretty straight forward. We have two structs Foo and Bar, both of which implement the Debug trait. Now let’s see what happens when we run ctags.

$ ctags --fields=+K lib.rs
Bar	lib.rs	/^impl Debug for Bar {$/;"	implementation
Bar	lib.rs	/^struct Bar;$/;"	struct
Foo	lib.rs	/^impl Debug for Foo {$/;"	implementation
Foo	lib.rs	/^struct Foo;$/;"	struct
fmt	lib.rs	/^    fn fmt(&self, &mut fmt::Formatter<'_>) -> fmt::Result {$/;"	method	implementation:Bar
fmt	lib.rs	/^    fn fmt(&self, &mut fmt::Formatter<'_>) -> fmt::Result {$/;"	method	implementation:Foo

All appears normal. We see our two structs, implementations, and methods appearing as expected.

The --fields=+K flag doesn’t do anything unexpected either. It simply changes the output to emit the long symbol kind struct rather than the shorter s.

Now let’s extend lib.rs with some more code.

impl Display for Bar {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        ...
    }
}

What do you expect to see in the output from ctags? I can tell you that I was expecting to see an additional implementation and an additional method reflected in the generated tags.

$ ctags --fields=+K lib.rs
Bar	lib.rs	/^impl Debug for Bar {$/;"	implementation
Bar	lib.rs	/^impl Display for Bar {$/;"	implementation
Bar	lib.rs	/^struct Bar;$/;"	struct
Foo	lib.rs	/^impl Debug for Foo {$/;"	implementation
Foo	lib.rs	/^struct Foo;$/;"	struct
fmt	lib.rs	/^    fn fmt(&self, &mut fmt::Formatter<'_>) -> fmt::Result {$/;"	method	implementation:Bar
fmt	lib.rs	/^    fn fmt(&self, &mut fmt::Formatter<'_>) -> fmt::Result {$/;"	method	implementation:Foo

There we see the three implementations, but only two fmt methods.

When I saw this, my first thought was to try and figure out which of the two fmt methods implemented for Bar were being picked up, and which one was being skipped by ctags. To do this, we can set the --fields flag to include line numbers.

$ ctags --fields=+Kn lib.rs
Bar	lib.rs	/^impl Debug for Bar {$/;"	implementation	line:11
Bar	lib.rs	/^impl Display for Bar {$/;"	implementation	line:17
Bar	lib.rs	/^struct Bar;$/;"	struct	line:9
Foo	lib.rs	/^impl Debug for Foo {$/;"	implementation	line:3
Foo	lib.rs	/^struct Foo;$/;"	struct	line:1
fmt	lib.rs	/^    fn fmt(&self, &mut fmt::Formatter<'_>) -> fmt::Result {$/;"	method	line:12	implementation:Bar
fmt	lib.rs	/^    fn fmt(&self, &mut fmt::Formatter<'_>) -> fmt::Result {$/;"	method	line:18	implementation:Bar
fmt	lib.rs	/^    fn fmt(&self, &mut fmt::Formatter<'_>) -> fmt::Result {$/;"	method	line:4	implementation:Foo

Alright, so now we are getting all three implementations, and all three fmt methods.

Looking at the two method lines for implementation:Bar we can see the only difference is the line number. So it appears that ctags will deduplicate tag entries that will appear the same when output, even if each tag entry refers to a distinct symbol.

Checking the man page, I didn’t find this behaviour documented. But there was something useful for the --excmd flag. Here we see that running ctags --excmd=number will “Use only line numbers in the tag file for locating tags.” And since our second run of ctags included line numbers there seems to be some relevance.

Reading on we see some advantages listed for including line numbers. Namely the fourth point.

Retains separate entries in the tag file for lines which are identical in content. In pattern mode, duplicate entries are dropped because the search patterns they generate are identical, making the duplicate entries useless.

On the surface this seems reasonable. But consider the implications for an editor (or plugin) that reads the generated tags and allows the user to jump to a specific symbol in the codebase. The tag file may seem like a set of all symbols in the codebase, but treating it as such may skip over some symbols.

Instead, what we have is a file containing the patterns of all symbols that ctags picked up. So the editor should, for each file, iterate over each pattern ctags captures for that file, and grep for the symbols to then display to the user. And, naively, this would have to happen each time the user wanted to jump to another symbol in the codebase since files can change all the time.

Alright, so that seems like a non-obvious corner case for someone wanting to work with ctags to have to think about. There is also a lot of unnecessary work the editor has to do each time we want to jump to a symbol. But we know we can have ctags output the set of all symbols by including line numbers so let’s consider that approach.

When using line numbers, rather than patterns, jumping to a symbol becomes trivial to implement and much more efficient. We could even just fuzzy match on the tags file itself to select a particular symbol, and then jump to the file and the line included in the tag entry. But, this does have a pretty significant drawback (also referenced in the man page).

changes to the source files can cause the line numbers recorded in the tag file to no longer correspond to the lines in the source file, causing jumps to some tags to miss the target definition by one or more lines. Basically, this option is best used when the source code to which it is applied is not subject to change.

Clearly then, this approach isn’t great either. Unless…

ctags itself isn’t particularly slow, but it isn’t blazingly fast either. And given that it runs entirely on a single thread, and symbol indexing is an embarrassingly parallel problem, there seems to be an opportunity for a significant speed up.

So, if we index symbols by file path and line number, and are able to reindex fast enough that it becomes reasonable to do so on each file change, then we can avoid the drawbacks mentioned in the man page. Unfortunately though, this doesn’t get rid of another (very similar) drawback with indexing by line number.

An interesting benefit of indexing with patterns is that even for unsaved buffers, the editor will be able to jump to the correct location for a symbol. And only symbols added prior to saving the file won’t be picked up. But unless the file is saved, with our new approach we won’t be aware of any changes and so the line numbers are likely to get out of sync.

It seems improving on this will require some more thought. Perhaps an API akin to the DidOpenTextDocument Notification in the Language Server Protocol spec…