Published on to joshleeb's blog
Today I came across some interesting behaviour implemented by ctags
(specifically Universal Ctags) with the way some tag entries are
handled. At first I was surprised. Then thinking about a bit more I can
understand why. But still, I feel there must be a better way.
Let’s take a look at some Rust code.
struct Foo;
impl Debug for Foo {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
...
}
}
struct Bar;
impl Debug for Bar {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
...
}
}
Looks pretty straight forward. We have two structs Foo
and Bar
, both of
which implement the Debug
trait. Now let’s see what happens when we run
ctags
.
$ ctags --fields=+K lib.rs
Bar lib.rs /^impl Debug for Bar {$/;" implementation
Bar lib.rs /^struct Bar;$/;" struct
Foo lib.rs /^impl Debug for Foo {$/;" implementation
Foo lib.rs /^struct Foo;$/;" struct
fmt lib.rs /^ fn fmt(&self, &mut fmt::Formatter<'_>) -> fmt::Result {$/;" method implementation:Bar
fmt lib.rs /^ fn fmt(&self, &mut fmt::Formatter<'_>) -> fmt::Result {$/;" method implementation:Foo
All appears normal. We see our two structs, implementations, and methods appearing as expected.
The --fields=+K
flag doesn’t do anything unexpected either. It simply
changes the output to emit the long symbol kind struct
rather than the
shorter s
.
Now let’s extend lib.rs
with some more code.
impl Display for Bar {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
...
}
}
What do you expect to see in the output from ctags
? I can tell you that I
was expecting to see an additional implementation and an additional method
reflected in the generated tags.
$ ctags --fields=+K lib.rs
Bar lib.rs /^impl Debug for Bar {$/;" implementation
Bar lib.rs /^impl Display for Bar {$/;" implementation
Bar lib.rs /^struct Bar;$/;" struct
Foo lib.rs /^impl Debug for Foo {$/;" implementation
Foo lib.rs /^struct Foo;$/;" struct
fmt lib.rs /^ fn fmt(&self, &mut fmt::Formatter<'_>) -> fmt::Result {$/;" method implementation:Bar
fmt lib.rs /^ fn fmt(&self, &mut fmt::Formatter<'_>) -> fmt::Result {$/;" method implementation:Foo
There we see the three implementations, but only two fmt
methods.
When I saw this, my first thought was to try and figure out which of the two
fmt
methods implemented for Bar
were being picked up, and which one was
being skipped by ctags
. To do this, we can set the --fields
flag to
include line numbers.
$ ctags --fields=+Kn lib.rs
Bar lib.rs /^impl Debug for Bar {$/;" implementation line:11
Bar lib.rs /^impl Display for Bar {$/;" implementation line:17
Bar lib.rs /^struct Bar;$/;" struct line:9
Foo lib.rs /^impl Debug for Foo {$/;" implementation line:3
Foo lib.rs /^struct Foo;$/;" struct line:1
fmt lib.rs /^ fn fmt(&self, &mut fmt::Formatter<'_>) -> fmt::Result {$/;" method line:12 implementation:Bar
fmt lib.rs /^ fn fmt(&self, &mut fmt::Formatter<'_>) -> fmt::Result {$/;" method line:18 implementation:Bar
fmt lib.rs /^ fn fmt(&self, &mut fmt::Formatter<'_>) -> fmt::Result {$/;" method line:4 implementation:Foo
Alright, so now we are getting all three implementations, and all three fmt
methods.
Looking at the two method
lines for implementation:Bar
we can see the only
difference is the line number. So it appears that ctags
will deduplicate tag
entries that will appear the same when output, even if each tag entry refers
to a distinct symbol.
Checking the man page, I didn’t find this behaviour documented.
But there was something useful for the --excmd
flag. Here we see that
running ctags --excmd=number
will “Use only line numbers in the tag file
for locating tags.” And since our second run of ctags
included line numbers
there seems to be some relevance.
Reading on we see some advantages listed for including line numbers. Namely the fourth point.
Retains separate entries in the tag file for lines which are identical in content. In pattern mode, duplicate entries are dropped because the search patterns they generate are identical, making the duplicate entries useless.
On the surface this seems reasonable. But consider the implications for an editor (or plugin) that reads the generated tags and allows the user to jump to a specific symbol in the codebase. The tag file may seem like a set of all symbols in the codebase, but treating it as such may skip over some symbols.
Instead, what we have is a file containing the patterns of all symbols that ctags picked up. So the editor should, for each file, iterate over each pattern ctags captures for that file, and grep for the symbols to then display to the user. And, naively, this would have to happen each time the user wanted to jump to another symbol in the codebase since files can change all the time.
Alright, so that seems like a non-obvious corner case for someone wanting to
work with ctags
to have to think about. There is also a lot of unnecessary
work the editor has to do each time we want to jump to a symbol. But we know
we can have ctags
output the set of all symbols by including line numbers so
let’s consider that approach.
When using line numbers, rather than patterns, jumping to a symbol becomes trivial to implement and much more efficient. We could even just fuzzy match on the tags file itself to select a particular symbol, and then jump to the file and the line included in the tag entry. But, this does have a pretty significant drawback (also referenced in the man page).
changes to the source files can cause the line numbers recorded in the tag file to no longer correspond to the lines in the source file, causing jumps to some tags to miss the target definition by one or more lines. Basically, this option is best used when the source code to which it is applied is not subject to change.
Clearly then, this approach isn’t great either. Unless…
ctags
itself isn’t particularly slow, but it isn’t blazingly fast either.
And given that it runs entirely on a single thread, and
symbol indexing is an embarrassingly parallel problem, there seems to be an
opportunity for a significant speed up.
So, if we index symbols by file path and line number, and are able to reindex fast enough that it becomes reasonable to do so on each file change, then we can avoid the drawbacks mentioned in the man page. Unfortunately though, this doesn’t get rid of another (very similar) drawback with indexing by line number.
An interesting benefit of indexing with patterns is that even for unsaved buffers, the editor will be able to jump to the correct location for a symbol. And only symbols added prior to saving the file won’t be picked up. But unless the file is saved, with our new approach we won’t be aware of any changes and so the line numbers are likely to get out of sync.
It seems improving on this will require some more thought. Perhaps an API akin to the DidOpenTextDocument Notification in the Language Server Protocol spec…