syzbot |
sign-in | mailing list | source | docs | 🏰 |
| ID | Workflow | Result | Correct | Bug | Created | Started | Finished | Revision | Error |
|---|---|---|---|---|---|---|---|---|---|
| 0d8f4d24-a8bb-45ea-9eb0-dbb1648db2bc | assessment-kcsan | Benign: ✅ Confident: ✅ | ❓ | KCSAN: data-race in fsnotify_destroy_mark / inotify_handle_inode_event (9) | 2026/03/26 00:03 | 2026/03/26 00:03 | 2026/03/26 00:27 | c6143aacd8e1a01f2482ab5eec709f9949710cee |
The data race occurs on the `flags` field of `struct fsnotify_mark`.
Specifically, it is a race between a write to clear the
`FSNOTIFY_MARK_FLAG_ALIVE` bit (bit 0) in `fsnotify_free_mark()` and a read of
the `FSNOTIFY_MARK_FLAG_IN_ONESHOT` bit (bit 5) in
`inotify_handle_inode_event()`.
### Analysis of the Race
1. **Write Access**: In `fsnotify_free_mark()`, the code clears the
`FSNOTIFY_MARK_FLAG_ALIVE` flag under the protection of `mark->lock`:
```c spin_lock(&mark->lock); ... mark->flags &= ~FSNOTIFY_MARK_FLAG_ALIVE;
spin_unlock(&mark->lock); ``` This is called during mark destruction (e.g.,
via `inotify_rm_watch()`).
2. **Read Access**: In `inotify_handle_inode_event()`, the code checks if the
mark is a "oneshot" mark to decide whether to destroy it after the event is
handled:
```c if (inode_mark->flags & FSNOTIFY_MARK_FLAG_IN_ONESHOT)
fsnotify_destroy_mark(inode_mark, group);
``` This read is performed without holding `mark->lock`.
### Why the Race is Benign
* **Different Bits**: The bits being accessed are different (`0x0001` vs
`0x0020`). On a 64-bit system, a 32-bit `unsigned int` access is atomic at the
word level. A write to one bit will not cause a concurrent read of another bit
in the same word to return an incorrect value, provided the compiler does not
perform unusual optimizations.
* **Flag Stability**: The `FSNOTIFY_MARK_FLAG_IN_ONESHOT` flag is set when the
inotify watch is created and is never cleared during the lifetime of the mark.
Therefore, the read in `inotify_handle_inode_event()` will always see the
correct value for this bit, regardless of whether `FSNOTIFY_MARK_FLAG_ALIVE` is
being cleared simultaneously.
* **Idempotency of Destruction**: If the race were to cause
`inotify_handle_inode_event()` to call `fsnotify_destroy_mark()` while another
thread is already destroying the mark, it is safe. `fsnotify_destroy_mark()` and
its sub-functions (`fsnotify_detach_mark()` and `fsnotify_free_mark()`) contain
internal checks (protected by locks) to ensure that detachment and freeing only
happen once.
* **Object Lifetime**: The `fsnotify_mark` object is kept alive during event
handling by SRCU (and potentially a reference count), so the memory access
itself is safe from use-after-free.
* **No Functional Impact**: As shown in the KCSAN report (`value changed:
0x00000001 -> 0x00000000`), the `IN_ONESHOT` bit was not even set in this
instance. Even if it were, the logic would remain correct as the `IN_ONESHOT`
bit remains stable.
In conclusion, this is a benign data race on a flags field that does not lead to
memory corruption, crashes, or incorrect logic. While using
`READ_ONCE()`/`WRITE_ONCE()` or `data_race()` would satisfy KCSAN, the current
code is functionally correct.================================================================== BUG: KCSAN: data-race in fsnotify_destroy_mark / inotify_handle_inode_event write to 0xffff888104c23184 of 4 bytes by task 11506 on cpu 1: fsnotify_free_mark fs/notify/mark.c:584 [inline] fsnotify_destroy_mark+0xeb/0x150 fs/notify/mark.c:602 __do_sys_inotify_rm_watch fs/notify/inotify/inotify_user.c:805 [inline] __se_sys_inotify_rm_watch+0xe8/0x170 fs/notify/inotify/inotify_user.c:786 __x64_sys_inotify_rm_watch+0x31/0x40 fs/notify/inotify/inotify_user.c:786 x64_sys_call+0xc88/0x3020 arch/x86/include/generated/asm/syscalls_64.h:256 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff888104c23184 of 4 bytes by task 3311 on cpu 0: inotify_handle_inode_event+0x34c/0x3d0 fs/notify/inotify/inotify_fsnotify.c:132 fsnotify_handle_inode_event+0x1c0/0x220 fs/notify/fsnotify.c:272 fsnotify_handle_event fs/notify/fsnotify.c:327 [inline] send_to_group fs/notify/fsnotify.c:375 [inline] fsnotify+0x15d0/0x1730 fs/notify/fsnotify.c:592 __fsnotify_parent+0x2f8/0x3a0 fs/notify/fsnotify.c:238 fsnotify_parent include/linux/fsnotify.h:96 [inline] fsnotify_path include/linux/fsnotify.h:113 [inline] fsnotify_file include/linux/fsnotify.h:127 [inline] fsnotify_close include/linux/fsnotify.h:457 [inline] __fput+0x1de/0x650 fs/file_table.c:455 fput_close_sync+0xad/0x190 fs/file_table.c:574 __do_sys_close fs/open.c:1509 [inline] __se_sys_close fs/open.c:1494 [inline] __x64_sys_close+0x55/0xf0 fs/open.c:1494 x64_sys_call+0x275d/0x3020 arch/x86/include/generated/asm/syscalls_64.h:4 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x00000001 -> 0x00000000 Reported by Kernel Concurrency Sanitizer on: CPU: 0 UID: 0 PID: 3311 Comm: syz-executor Tainted: G W syzkaller #0 PREEMPT(full) Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026 ==================================================================
| Seq | Timestamp | Type | Name | Duration |
|---|---|---|---|---|
| 0/0 | 2026/03/26 00:03 | flow | assessment-kcsan |
24mResults:map[Benign:true Confident:true Explanation:The data race occurs on the `flags` field of `struct fsnotify_mark`.
Specifically, it is a race between a write to clear the
`FSNOTIFY_MARK_FLAG_ALIVE` bit (bit 0) in `fsnotify_free_mark()` and a read of
the `FSNOTIFY_MARK_FLAG_IN_ONESHOT` bit (bit 5) in
`inotify_handle_inode_event()`.
### Analysis of the Race
1. **Write Access**: In `fsnotify_free_mark()`, the code clears the
`FSNOTIFY_MARK_FLAG_ALIVE` flag under the protection of `mark->lock`:
```c spin_lock(&mark->lock); ... mark->flags &= ~FSNOTIFY_MARK_FLAG_ALIVE;
spin_unlock(&mark->lock); ``` This is called during mark destruction (e.g.,
via `inotify_rm_watch()`).
2. **Read Access**: In `inotify_handle_inode_event()`, the code checks if the
mark is a "oneshot" mark to decide whether to destroy it after the event is
handled:
```c if (inode_mark->flags & FSNOTIFY_MARK_FLAG_IN_ONESHOT)
fsnotify_destroy_mark(inode_mark, group);
``` This read is performed without holding `mark->lock`.
### Why the Race is Benign
* **Different Bits**: The bits being accessed are different (`0x0001` vs
`0x0020`). On a 64-bit system, a 32-bit `unsigned int` access is atomic at the
word level. A write to one bit will not cause a concurrent read of another bit
in the same word to return an incorrect value, provided the compiler does not
perform unusual optimizations.
* **Flag Stability**: The `FSNOTIFY_MARK_FLAG_IN_ONESHOT` flag is set when the
inotify watch is created and is never cleared during the lifetime of the mark.
Therefore, the read in `inotify_handle_inode_event()` will always see the
correct value for this bit, regardless of whether `FSNOTIFY_MARK_FLAG_ALIVE` is
being cleared simultaneously.
* **Idempotency of Destruction**: If the race were to cause
`inotify_handle_inode_event()` to call `fsnotify_destroy_mark()` while another
thread is already destroying the mark, it is safe. `fsnotify_destroy_mark()` and
its sub-functions (`fsnotify_detach_mark()` and `fsnotify_free_mark()`) contain
internal checks (protected by locks) to ensure that detachment and freeing only
happen once.
* **Object Lifetime**: The `fsnotify_mark` object is kept alive during event
handling by SRCU (and potentially a reference count), so the memory access
itself is safe from use-after-free.
* **No Functional Impact**: As shown in the KCSAN report (`value changed:
0x00000001 -> 0x00000000`), the `IN_ONESHOT` bit was not even set in this
instance. Even if it were, the logic would remain correct as the `IN_ONESHOT`
bit remains stable.
In conclusion, this is a benign data race on a flags field that does not lead to
memory corruption, crashes, or incorrect logic. While using
`READ_ONCE()`/`WRITE_ONCE()` or `data_race()` would satisfy KCSAN, the current
code is functionally correct.] |
| 1/1 | 2026/03/26 00:03 | action | kernel-checkouter |
0mResults:map[KernelSrc:/app/workdir/cache/src/0c795072baea38aa27acc4e078034f17971ff9aa] |
| 2/1 | 2026/03/26 00:03 | action | kernel-builder |
11mResults:map[KernelObj:/app/workdir/cache/build/9ee5fab87722701db76e3a742fe26d727701e2b2] |
| 3/1 | 2026/03/26 00:15 | action | codesearch-prepare |
9mResults:map[Index:codesearch-index] |
| 4/1 | 2026/03/26 00:25 | agent | expert |
2mModel:gemini-3-flash-preview Results: map[Benign:true Confident:true] Instruction: You are an experienced Linux kernel developer tasked with determining if the given kernel data race is benign or not. The data race report is from KCSAN tool. It contains 2 stack traces of the memory accesses that constitute a data race. A "benign" data races are on a simple int/bool variable or similar field, and the accesses are not supposed to be protected by any mutual exclusion primitives. Common examples of such "benign" data races are accesses to various flags fields, statistics counters, and similar. A "benign" data race does not lead to memory corruption/crash with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A non-benign (or "harmful" data race) can lead to corruption/crash even with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A common example of a "harmful" data races is race on a complex container (list/hashmap/etc), where accesses are supposed to be protected by a mutual exclusion primitive. Also consider races that happen at the same time with the given one. If there is no synchronization in between, other memory accesses in the involved threads race with each other if they access the same memory. For example, if both threads execute: some_struct->have_elements = true; list_add(new_node, &some_struct->list_head); the race on some_struct->have_elements may appear benign, however it also implies there is a race on some_struct->list_head which is not benign, since the list is not thread-safe. Take into account that on 32-bit systems 64-bit memory accesses may be split into two accesses, and thus even with a conservative compiler may not be fully atomic. However, such races may still be benign depending on how writes are done, and how read data is used. In the final reply explain why you think the given data race is benign or is harmful. Use the provided tools to confirm any assumptions, variables/fields being accessed, etc. In particular, don't make assumptions about the kernel source code, use codesearch tools to read the actual source code. Prefer calling several tools at the same time to save round-trips. Use set-results tool to provide results of the analysis. It must be called exactly once before the final reply. Ignore results of this tool. Prompt: The data race report is: ================================================================== BUG: KCSAN: data-race in fsnotify_destroy_mark / inotify_handle_inode_event write to 0xffff888104c23184 of 4 bytes by task 11506 on cpu 1: fsnotify_free_mark fs/notify/mark.c:584 [inline] fsnotify_destroy_mark+0xeb/0x150 fs/notify/mark.c:602 __do_sys_inotify_rm_watch fs/notify/inotify/inotify_user.c:805 [inline] __se_sys_inotify_rm_watch+0xe8/0x170 fs/notify/inotify/inotify_user.c:786 __x64_sys_inotify_rm_watch+0x31/0x40 fs/notify/inotify/inotify_user.c:786 x64_sys_call+0xc88/0x3020 arch/x86/include/generated/asm/syscalls_64.h:256 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff888104c23184 of 4 bytes by task 3311 on cpu 0: inotify_handle_inode_event+0x34c/0x3d0 fs/notify/inotify/inotify_fsnotify.c:132 fsnotify_handle_inode_event+0x1c0/0x220 fs/notify/fsnotify.c:272 fsnotify_handle_event fs/notify/fsnotify.c:327 [inline] send_to_group fs/notify/fsnotify.c:375 [inline] fsnotify+0x15d0/0x1730 fs/notify/fsnotify.c:592 __fsnotify_parent+0x2f8/0x3a0 fs/notify/fsnotify.c:238 fsnotify_parent include/linux/fsnotify.h:96 [inline] fsnotify_path include/linux/fsnotify.h:113 [inline] fsnotify_file include/linux/fsnotify.h:127 [inline] fsnotify_close include/linux/fsnotify.h:457 [inline] __fput+0x1de/0x650 fs/file_table.c:455 fput_close_sync+0xad/0x190 fs/file_table.c:574 __do_sys_close fs/open.c:1509 [inline] __se_sys_close fs/open.c:1494 [inline] __x64_sys_close+0x55/0xf0 fs/open.c:1494 x64_sys_call+0x275d/0x3020 arch/x86/include/generated/asm/syscalls_64.h:4 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x00000001 -> 0x00000000 Reported by Kernel Concurrency Sanitizer on: CPU: 0 UID: 0 PID: 3311 Comm: syz-executor Tainted: G W syzkaller #0 PREEMPT(full) Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026 ================================================================== Reply: The data race occurs on the `flags` field of `struct fsnotify_mark`. Specifically, it is a race between a write to clear the `FSNOTIFY_MARK_FLAG_ALIVE` bit (bit 0) in `fsnotify_free_mark()` and a read of the `FSNOTIFY_MARK_FLAG_IN_ONESHOT` bit (bit 5) in `inotify_handle_inode_event()`.
### Analysis of the Race
1. **Write Access**: In `fsnotify_free_mark()`, the code clears the `FSNOTIFY_MARK_FLAG_ALIVE` flag under the protection of `mark->lock`:
```c
spin_lock(&mark->lock);
...
mark->flags &= ~FSNOTIFY_MARK_FLAG_ALIVE;
spin_unlock(&mark->lock);
```
This is called during mark destruction (e.g., via `inotify_rm_watch()`).
2. **Read Access**: In `inotify_handle_inode_event()`, the code checks if the mark is a "oneshot" mark to decide whether to destroy it after the event is handled:
```c
if (inode_mark->flags & FSNOTIFY_MARK_FLAG_IN_ONESHOT)
fsnotify_destroy_mark(inode_mark, group);
```
This read is performed without holding `mark->lock`.
### Why the Race is Benign
* **Different Bits**: The bits being accessed are different (`0x0001` vs `0x0020`). On a 64-bit system, a 32-bit `unsigned int` access is atomic at the word level. A write to one bit will not cause a concurrent read of another bit in the same word to return an incorrect value, provided the compiler does not perform unusual optimizations.
* **Flag Stability**: The `FSNOTIFY_MARK_FLAG_IN_ONESHOT` flag is set when the inotify watch is created and is never cleared during the lifetime of the mark. Therefore, the read in `inotify_handle_inode_event()` will always see the correct value for this bit, regardless of whether `FSNOTIFY_MARK_FLAG_ALIVE` is being cleared simultaneously.
* **Idempotency of Destruction**: If the race were to cause `inotify_handle_inode_event()` to call `fsnotify_destroy_mark()` while another thread is already destroying the mark, it is safe. `fsnotify_destroy_mark()` and its sub-functions (`fsnotify_detach_mark()` and `fsnotify_free_mark()`) contain internal checks (protected by locks) to ensure that detachment and freeing only happen once.
* **Object Lifetime**: The `fsnotify_mark` object is kept alive during event handling by SRCU (and potentially a reference count), so the memory access itself is safe from use-after-free.
* **No Functional Impact**: As shown in the KCSAN report (`value changed: 0x00000001 -> 0x00000000`), the `IN_ONESHOT` bit was not even set in this instance. Even if it were, the logic would remain correct as the `IN_ONESHOT` bit remains stable.
In conclusion, this is a benign data race on a flags field that does not lead to memory corruption, crashes, or incorrect logic. While using `READ_ONCE()`/`WRITE_ONCE()` or `data_race()` would satisfy KCSAN, the current code is functionally correct. |
| 5/2 | 2026/03/26 00:25 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 4083 output: 93 thoughts: 120 Thoughts: **Analyzing a Data Race in the Filesystem Notification System** Okay, here's what's running through my head. I'm looking at a pretty critical data race, and it seems to be between `fsnotify_free_mark` and `inotify_handle_inode_event`. That immediately tells me this could be impacting file system monitoring, which, if mishandled, could be serious. I need to get to the root of this. The write side of things is coming from `fsnotify_free_mark`, specifically at `fs/notify/mark.c:584`. The read happens in `inotify_handle_inode_event`, at `fs/notify/inotify/inotify_fsnotify.c:132`. This screams "synchronization problem." It looks like the value changed from 1 to 0; this is the key to it. My next move is obvious: I have to dive into the code at those two locations. I need to understand what's being written, what's being read, what the underlying data structure is, and, most critically, the locking mechanisms (or lack thereof) that are supposed to be protecting the data from concurrent access. This is going to be about understanding the data flow and how the kernel handles concurrent access to filesystem metadata. I need to figure out exactly how this '1 to 0' transition could lead to the race condition, and the implications of this value change. |
| 6/2 | 2026/03/26 00:25 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/notify/mark.c IncludeLines:true Name:fsnotify_free_mark] Results: map[SourceCode: 566: /*
567: * Free fsnotify mark. The mark is actually only marked as being freed. The
568: * freeing is actually happening only once last reference to the mark is
569: * dropped from a workqueue which first waits for srcu period end.
570: *
571: * Caller must have a reference to the mark or be protected by
572: * fsnotify_mark_srcu.
573: */
574: void fsnotify_free_mark(struct fsnotify_mark *mark)
575: {
576: struct fsnotify_group *group = mark->group;
577:
578: spin_lock(&mark->lock);
579: /* something else already called this function on this mark */
580: if (!(mark->flags & FSNOTIFY_MARK_FLAG_ALIVE)) {
581: spin_unlock(&mark->lock);
582: return;
583: }
584: mark->flags &= ~FSNOTIFY_MARK_FLAG_ALIVE;
585: spin_unlock(&mark->lock);
586:
587: /*
588: * Some groups like to know that marks are being freed. This is a
589: * callback to the group function to let it know that this mark
590: * is being freed.
591: */
592: if (group->ops->freeing_mark)
593: group->ops->freeing_mark(mark, group);
594: }
SourceFile:fs/notify/mark.c] |
| 7/2 | 2026/03/26 00:25 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/notify/inotify/inotify_fsnotify.c IncludeLines:true Name:inotify_handle_inode_event] Results: map[SourceCode: 59: int inotify_handle_inode_event(struct fsnotify_mark *inode_mark, u32 mask,
60: struct inode *inode, struct inode *dir,
61: const struct qstr *name, u32 cookie)
62: {
63: struct inotify_inode_mark *i_mark;
64: struct inotify_event_info *event;
65: struct fsnotify_event *fsn_event;
66: struct fsnotify_group *group = inode_mark->group;
67: int ret;
68: int len = 0, wd;
69: int alloc_len = sizeof(struct inotify_event_info);
70: struct mem_cgroup *old_memcg;
71:
72: if (name) {
73: len = name->len;
74: alloc_len += len + 1;
75: }
76:
77: pr_debug("%s: group=%p mark=%p mask=%x\n", __func__, group, inode_mark,
78: mask);
79:
80: i_mark = container_of(inode_mark, struct inotify_inode_mark,
81: fsn_mark);
82:
83: /*
84: * We can be racing with mark being detached. Don't report event with
85: * invalid wd.
86: */
87: wd = READ_ONCE(i_mark->wd);
88: if (wd == -1)
89: return 0;
90: /*
91: * Whoever is interested in the event, pays for the allocation. Do not
92: * trigger OOM killer in the target monitoring memcg as it may have
93: * security repercussion.
94: */
95: old_memcg = set_active_memcg(group->memcg);
96: event = kmalloc(alloc_len, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
97: set_active_memcg(old_memcg);
98:
99: if (unlikely(!event)) {
100: /*
101: * Treat lost event due to ENOMEM the same way as queue
102: * overflow to let userspace know event was lost.
103: */
104: fsnotify_queue_overflow(group);
105: return -ENOMEM;
106: }
107:
108: /*
109: * We now report FS_ISDIR flag with MOVE_SELF and DELETE_SELF events
110: * for fanotify. inotify never reported IN_ISDIR with those events.
111: * It looks like an oversight, but to avoid the risk of breaking
112: * existing inotify programs, mask the flag out from those events.
113: */
114: if (mask & (IN_MOVE_SELF | IN_DELETE_SELF))
115: mask &= ~IN_ISDIR;
116:
117: fsn_event = &event->fse;
118: fsnotify_init_event(fsn_event);
119: event->mask = mask;
120: event->wd = wd;
121: event->sync_cookie = cookie;
122: event->name_len = len;
123: if (len)
124: strscpy(event->name, name->name, event->name_len + 1);
125:
126: ret = fsnotify_add_event(group, fsn_event, inotify_merge);
127: if (ret) {
128: /* Our event wasn't used in the end. Free it. */
129: fsnotify_destroy_event(group, fsn_event);
130: }
131:
132: if (inode_mark->flags & FSNOTIFY_MARK_FLAG_IN_ONESHOT)
133: fsnotify_destroy_mark(inode_mark, group);
134:
135: return 0;
136: }
SourceFile:fs/notify/inotify/inotify_fsnotify.c] |
| 8/2 | 2026/03/26 00:25 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 6277 output: 43 thoughts: 232 Thoughts: **Analyzing a Race Condition in fsnotify Code** Okay, so I'm looking at a potential race condition related to `inode_mark->flags`. The report points to a read operation happening in `inotify_handle_inode_event`, specifically line 132, and a corresponding write in `fsnotify_free_mark` at line 584. The read, at `inotify_handle_inode_event+0x34c/0x3d0`, is checking for `FSNOTIFY_MARK_FLAG_IN_ONESHOT`, while the write is clearing `FSNOTIFY_MARK_FLAG_ALIVE` in `fsnotify_free_mark`. Now, the issue immediately becomes clearer. `FSNOTIFY_MARK_FLAG_ALIVE` is bit 0, which has a value of 1. `FSNOTIFY_MARK_FLAG_IN_ONESHOT` is another bit. The fact that the report points to those two lines occurring around the same time is enough to raise alarm bells, but let's be thorough. The critical thing here is to understand the intended interaction between these two flags. Are they independent, or is there a dependency that might introduce a window for a race? I'm going to need to look up how the flags are defined and used. I need to understand the semantics of both `FSNOTIFY_MARK_FLAG_IN_ONESHOT` and `FSNOTIFY_MARK_FLAG_ALIVE` and their relationship to the overall fsnotify lifecycle. I need to determine if a scenario can exist where the read on line 132 happens *after* the write on line 584, and what the implications are. This needs careful examination. |
| 9/2 | 2026/03/26 00:25 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:include/linux/fsnotify_backend.h IncludeLines:true Name:fsnotify_mark] Results: map[SourceCode: 593: /*
594: * A mark is simply an object attached to an in core inode which allows an
595: * fsnotify listener to indicate they are either no longer interested in events
596: * of a type matching mask or only interested in those events.
597: *
598: * These are flushed when an inode is evicted from core and may be flushed
599: * when the inode is modified (as seen by fsnotify_access). Some fsnotify
600: * users (such as dnotify) will flush these when the open fd is closed and not
601: * at inode eviction or modification.
602: *
603: * Text in brackets is showing the lock(s) protecting modifications of a
604: * particular entry. obj_lock means either inode->i_lock or
605: * mnt->mnt_root->d_lock depending on the mark type.
606: */
607: struct fsnotify_mark {
608: /* Mask this mark is for [mark->lock, group->mark_mutex] */
609: __u32 mask;
610: /* We hold one for presence in g_list. Also one ref for each 'thing'
611: * in kernel that found and may be using this mark. */
612: refcount_t refcnt;
613: /* Group this mark is for. Set on mark creation, stable until last ref
614: * is dropped */
615: struct fsnotify_group *group;
616: /* List of marks by group->marks_list. Also reused for queueing
617: * mark into destroy_list when it's waiting for the end of SRCU period
618: * before it can be freed. [group->mark_mutex] */
619: struct list_head g_list;
620: /* Protects inode / mnt pointers, flags, masks */
621: spinlock_t lock;
622: /* List of marks for inode / vfsmount [connector->lock, mark ref] */
623: struct hlist_node obj_list;
624: /* Head of list of marks for an object [mark ref] */
625: struct fsnotify_mark_connector *connector;
626: /* Events types and flags to ignore [mark->lock, group->mark_mutex] */
627: __u32 ignore_mask;
628: /* General fsnotify mark flags */
629: #define FSNOTIFY_MARK_FLAG_ALIVE 0x0001
630: #define FSNOTIFY_MARK_FLAG_ATTACHED 0x0002
631: /* inotify mark flags */
632: #define FSNOTIFY_MARK_FLAG_EXCL_UNLINK 0x0010
633: #define FSNOTIFY_MARK_FLAG_IN_ONESHOT 0x0020
634: /* fanotify mark flags */
635: #define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY 0x0100
636: #define FSNOTIFY_MARK_FLAG_NO_IREF 0x0200
637: #define FSNOTIFY_MARK_FLAG_HAS_IGNORE_FLAGS 0x0400
638: #define FSNOTIFY_MARK_FLAG_HAS_FSID 0x0800
639: #define FSNOTIFY_MARK_FLAG_WEAK_FSID 0x1000
640: unsigned int flags; /* flags [mark->lock] */
641: };
SourceFile:include/linux/fsnotify_backend.h] |
| 10/2 | 2026/03/26 00:25 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 7931 output: 42 thoughts: 492 Thoughts: **Analyzing a Potential Race Condition in FSNOTIFY** Okay, so I'm looking at this FSNOTIFY code and seeing a potential race condition that needs to be addressed. The core of it seems to involve the `FSNOTIFY_MARK_FLAG_ALIVE` and `FSNOTIFY_MARK_FLAG_IN_ONESHOT` flags, which are defined as `0x0001` and `0x0020` respectively. The report clearly shows that `FSNOTIFY_MARK_FLAG_ALIVE` was being set and then cleared: `0x00000001 -> 0x00000000`. And I can confirm from the code, the write action does a clear via `mark->flags &= ~FSNOTIFY_MARK_FLAG_ALIVE;`. The read action checks the `IN_ONESHOT` flag `inode_mark->flags & FSNOTIFY_MARK_FLAG_IN_ONESHOT`. Also the report tells me that `FSNOTIFY_MARK_FLAG_IN_ONESHOT` was *not* set during this change, otherwise the value would be 0x0021 -> 0x0020. This indicates that the flag wasn't set. Now, here's where it gets interesting. If `FSNOTIFY_MARK_FLAG_IN_ONESHOT` isn't set, then `inotify_handle_inode_event` *shouldn't* call `fsnotify_destroy_mark`. However, I'm concerned because the race might be on the same `flags` field. The `fsnotify_free_mark` function is where things are being cleaned up and the `ALIVE` flag is set to zero. I can see there's a lock used here: `spin_lock(&mark->lock)` around the bitwise clear, `mark->flags &= ~FSNOTIFY_MARK_FLAG_ALIVE;`. But the key point is in `inotify_handle_inode_event`. It looks like it *conditionally* calls `fsnotify_destroy_mark` based on checking `FSNOTIFY_MARK_FLAG_IN_ONESHOT`. The relevant section of the code is: ```c 132: if (inode_mark->flags & FSNOTIFY_MARK_FLAG_IN_ONESHOT) 133: fsnotify_destroy_mark(inode_mark, group); ``` The really important detail here is that *this conditional check on line 132 is NOT protected by `mark->lock`*. If `FSNOTIFY_MARK_FLAG_IN_ONESHOT` *is* set, then `inotify_handle_inode_event` *will* call `fsnotify_destroy_mark`. I need to review this path further to confirm. And, of course, `fsnotify_destroy_mark` calls `fsnotify_free_mark`, which is the cleanup I mentioned before. Now, I have to examine `fsnotify_destroy_mark`. |
| 11/2 | 2026/03/26 00:25 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/notify/mark.c IncludeLines:true Name:fsnotify_destroy_mark] Results: map[SourceCode: 596: void fsnotify_destroy_mark(struct fsnotify_mark *mark,
597: struct fsnotify_group *group)
598: {
599: fsnotify_group_lock(group);
600: fsnotify_detach_mark(mark);
601: fsnotify_group_unlock(group);
602: fsnotify_free_mark(mark);
603: }
SourceFile:fs/notify/mark.c] |
| 12/2 | 2026/03/26 00:25 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 9279 output: 42 thoughts: 726 Thoughts: **Analyzing a Potential Data Race in the inotify Implementation**
Okay, let's break this down. My initial thought is that `fsnotify_destroy_mark` calls `fsnotify_free_mark`, and `fsnotify_free_mark` clears the `FSNOTIFY_MARK_FLAG_ALIVE` flag. Simple enough.
Now, if `inotify_handle_inode_event` gets called, it means there's been an inotify event. And, if the mark is a "oneshot" mark, it *should* be destroyed after handling that event. The core of this seems to be a race condition.
I see the race is between two things: `inotify_rm_watch` called from userspace, which in turn calls `fsnotify_destroy_mark`, *and* `inotify_handle_inode_event`, which *might* also call `fsnotify_destroy_mark` if it's dealing with a oneshot mark. Essentially, `inotify_rm_watch` and an event from `inotify_handle_inode_event` can happen concurrently.
Okay, let's drill down into `inotify_handle_inode_event`:
```c
132: if (inode_mark->flags & FSNOTIFY_MARK_FLAG_IN_ONESHOT)
133: fsnotify_destroy_mark(inode_mark, group);
```
So, if the `FSNOTIFY_MARK_FLAG_IN_ONESHOT` flag is set, it calls `fsnotify_destroy_mark`. Crucially, if `fsnotify_destroy_mark` is already running on this mark, then `fsnotify_free_mark` will clear the `FSNOTIFY_MARK_FLAG_ALIVE` flag.
Wait a second. I remember `fsnotify_free_mark` has a crucial check:
```c
578: spin_lock(&mark->lock);
579: /* something else already called this function on this mark */
580: if (!(mark->flags & FSNOTIFY_MARK_FLAG_ALIVE)) {
581: spin_unlock(&mark->lock);
582: return;
583: }
584: mark->flags &= ~FSNOTIFY_MARK_FLAG_ALIVE;
585: spin_unlock(&mark->lock);
```
That `mark->lock` is supposed to prevent multiple calls to `fsnotify_free_mark` from stepping on each other. That's good.
But, the line: `if (inode_mark->flags & FSNOTIFY_MARK_FLAG_IN_ONESHOT)` *in* `inotify_handle_inode_event` isn't protected by `mark->lock`. That's where the problem is.
Now, that's it! If `inotify_handle_inode_event` reads `flags` while `fsnotify_free_mark` is writing to it, we have a data race!
Is it harmful? Let's trace it out. If `inotify_handle_inode_event` reads the flags and sees `FSNOTIFY_MARK_FLAG_IN_ONESHOT` is set, it calls `fsnotify_destroy_mark`. That, in turn, calls `fsnotify_free_mark`. `fsnotify_free_mark` then takes `mark->lock`, sees `FSNOTIFY_MARK_FLAG_ALIVE` is already cleared (if the other thread already did it), and returns.
Wait, `fsnotify_destroy_mark` also calls `fsnotify_detach_mark`, so let me check that function now.
|
| 13/2 | 2026/03/26 00:25 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/notify/mark.c IncludeLines:true Name:fsnotify_detach_mark] Results: map[SourceCode: 535: /*
536: * Mark mark as detached, remove it from group list. Mark still stays in object
537: * list until its last reference is dropped. Note that we rely on mark being
538: * removed from group list before corresponding reference to it is dropped. In
539: * particular we rely on mark->connector being valid while we hold
540: * group->mark_mutex if we found the mark through g_list.
541: *
542: * Must be called with group->mark_mutex held. The caller must either hold
543: * reference to the mark or be protected by fsnotify_mark_srcu.
544: */
545: void fsnotify_detach_mark(struct fsnotify_mark *mark)
546: {
547: fsnotify_group_assert_locked(mark->group);
548: WARN_ON_ONCE(!srcu_read_lock_held(&fsnotify_mark_srcu) &&
549: refcount_read(&mark->refcnt) < 1 +
550: !!(mark->flags & FSNOTIFY_MARK_FLAG_ATTACHED));
551:
552: spin_lock(&mark->lock);
553: /* something else already called this function on this mark */
554: if (!(mark->flags & FSNOTIFY_MARK_FLAG_ATTACHED)) {
555: spin_unlock(&mark->lock);
556: return;
557: }
558: mark->flags &= ~FSNOTIFY_MARK_FLAG_ATTACHED;
559: list_del_init(&mark->g_list);
560: spin_unlock(&mark->lock);
561:
562: /* Drop mark reference acquired in fsnotify_add_mark_locked() */
563: fsnotify_put_mark(mark);
564: }
SourceFile:fs/notify/mark.c] |
| 14/2 | 2026/03/26 00:25 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 11419 output: 26 thoughts: 2186 Thoughts: **Analyzing a Potential Data Race in fsnotify** Okay, so I'm looking at this code in `fsnotify` and I've got a potential data race on my hands. Specifically, the race is between `fsnotify_free_mark` and `inotify_handle_inode_event`, both touching the `mark->flags` field. `fsnotify_free_mark` is clearing the `FSNOTIFY_MARK_FLAG_ALIVE` flag, while `inotify_handle_inode_event` is checking the `FSNOTIFY_MARK_FLAG_IN_ONESHOT` flag. Both are working on a shared memory location but without explicit synchronization. The initial thought was that a concurrent write to one bit in an `unsigned int` shouldn't affect the read of another bit, assuming the architecture provides atomic word-level read/writes, which Linux generally relies on. So, if `fsnotify_free_mark` is in the middle of writing, `inotify_handle_inode_event` might read an intermediate value, but the individual bit read should be consistent. However, the compiler could introduce optimizations that change the atomicity of the operation, which is where KCSAN's warning becomes relevant. `fsnotify_destroy_mark` (called by `inotify_rm_watch`) calls `fsnotify_free_mark` after `fsnotify_detach_mark`, clearing `FSNOTIFY_MARK_FLAG_ATTACHED` and ultimately `FSNOTIFY_MARK_FLAG_ALIVE`. `inotify_handle_inode_event` is called on the event. The question is: does the race cause problems in the scenario where the oneshot bit is set or not set? If `FSNOTIFY_MARK_FLAG_IN_ONESHOT` is set, and we miss it due to the race, then the mark will not be destroyed, but then it's ok as `fsnotify_free_mark` is running. This is okay, because the oneshot mark should be destroyed anyway. But the data race may allow this bit to be destroyed too early or too late. If Thread B does not see `IN_ONESHOT` set, it won't call `fsnotify_destroy_mark`. However, if Thread A is already destroying it, then it's fine. If there is a read of `0` in Thread B, this means that Thread B will not call `fsnotify_destroy_mark`, but it is OK as the mark is already being destroyed. The fact that the `inode_mark` is obtained via `fsnotify_foreach_iter_mark_type` suggests that SRCU is keeping the mark alive for the duration of the iteration. I'm starting to think this race *looks* benign, because a partially written value (with bit 0 being modified and bit 5 being read) shouldn't affect the check. However, KCSAN is flagging it, and that's a signal. The potential for compiler reordering or inconsistent memory views still exists. Maybe the compiler reloads it? Then it hit me: the code is already using `READ_ONCE` for `i_mark->wd`, but *not* for `inode_mark->flags`. That's a glaring inconsistency! Seems like it would be better to apply `READ_ONCE(inode_mark->flags)` to the flags read to ensure consistency, to avoid any possibility of seeing stale values due to compiler optimizations or memory model quirks. Taking the lock in the hot path is undesirable but seems necessary. |
| 15/2 | 2026/03/26 00:25 | tool | codesearch-dir-index |
0mArgs:map[Dir:fs/notify/fanotify/] Results: map[Files:[fanotify.c fanotify.h fanotify_user.c] Subdirs:<nil>] |
| 16/2 | 2026/03/26 00:25 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 14438 output: 49 thoughts: 29 Thoughts: Here's my attempt at summarizing the thought process, tailored for an expert audience: **Diving into the fanotify Event Handling** Okay, time to get to the heart of the matter. I need to understand how `fanotify` handles events. Let's start with the fundamental function: `fanotify_handle_event`. It's defined in `fs/notify/fanotify/fanotify.c`. This is where the magic, or the complexity, begins. I'm assuming this function is the central dispatch for fanotify's event processing. I'll need to see how it receives events, what it does with them, how it decides what action to take, and importantly, how it interacts with the user-space applications. I'll be looking for things like mutexes or locks, and how it handles concurrency. I will be paying close attention to any edge cases and race conditions. I need to understand the data structures involved, particularly how the event metadata is structured and what information is being passed around. My goal is to get a solid grasp of the internal event processing flow from the kernel perspective. This will be my starting point for a deeper dive. |
| 17/2 | 2026/03/26 00:25 | tool | codesearch-definition-source |
0mError:requested entity does not exist Args: map[ContextFile:fs/notify/fanotify/fanotify.c IncludeLines:true Name:fanotify_handle_event] Results: map[SourceCode: SourceFile:] |
| 18/2 | 2026/03/26 00:25 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 14787 output: 28 thoughts: 27 Thoughts: Okay, here's my take on those thoughts, tailored for an expert audience: **Investigating a Suspected Fanotify Handler** Hmm, based on the context, I'm leaning toward the possibility that we're dealing with `fanotify_handle_inode_event` here. It seems like the most likely candidate given the situation. Time to confirm or deny that hypothesis. Let's dive directly into the file contents and verify what's going on under the hood, and determine if that's indeed the route we need to take. |
| 19/2 | 2026/03/26 00:25 | tool | read-file |
0mArgs:map[File:fs/notify/fanotify/fanotify.c] Results: map[Contents:// SPDX-License-Identifier: GPL-2.0
#include <linux/fanotify.h>
#include <linux/fsnotify_backend.h>
#include <linux/init.h>
#include <linux/jiffies.h>
#include <linux/kernel.h> /* UINT_MAX */
#include <linux/mount.h>
#include <linux/sched.h>
#include <linux/sched/user.h>
#include <linux/sched/signal.h>
#include <linux/types.h>
#include <linux/wait.h>
#include <linux/audit.h>
#include <linux/sched/mm.h>
#include <linux/statfs.h>
#include <linux/stringhash.h>
#include "fanotify.h"
static bool fanotify_path_equal(const struct path *p1, const struct path *p2)
{
return p1->mnt == p2->mnt && p1->dentry == p2->dentry;
}
static unsigned int fanotify_hash_path(const struct path *path)
{
return hash_ptr(path->dentry, FANOTIFY_EVENT_HASH_BITS) ^
hash_ptr(path->mnt, FANOTIFY_EVENT_HASH_BITS);
}
static unsigned int fanotify_hash_fsid(__kernel_fsid_t *fsid)
{
return hash_32(fsid->val[0], FANOTIFY_EVENT_HASH_BITS) ^
hash_32(fsid->val[1], FANOTIFY_EVENT_HASH_BITS);
}
static bool fanotify_fh_equal(struct fanotify_fh *fh1,
struct fanotify_fh *fh2)
{
if (fh1->type != fh2->type || fh1->len != fh2->len)
return false;
return !fh1->len ||
!memcmp(fanotify_fh_buf(fh1), fanotify_fh_buf(fh2), fh1->len);
}
static unsigned int fanotify_hash_fh(struct fanotify_fh *fh)
{
long salt = (long)fh->type | (long)fh->len << 8;
/*
* full_name_hash() works long by long, so it handles fh buf optimally.
*/
return full_name_hash((void *)salt, fanotify_fh_buf(fh), fh->len);
}
static bool fanotify_fid_event_equal(struct fanotify_fid_event *ffe1,
struct fanotify_fid_event *ffe2)
{
/* Do not merge fid events without object fh */
if (!ffe1->object_fh.len)
return false;
return fanotify_fsid_equal(&ffe1->fsid, &ffe2->fsid) &&
fanotify_fh_equal(&ffe1->object_fh, &ffe2->object_fh);
}
static bool fanotify_info_equal(struct fanotify_info *info1,
struct fanotify_info *info2)
{
if (info1->dir_fh_totlen != info2->dir_fh_totlen ||
info1->dir2_fh_totlen != info2->dir2_fh_totlen ||
info1->file_fh_totlen != info2->file_fh_totlen ||
info1->name_len != info2->name_len ||
info1->name2_len != info2->name2_len)
return false;
if (info1->dir_fh_totlen &&
!fanotify_fh_equal(fanotify_info_dir_fh(info1),
fanotify_info_dir_fh(info2)))
return false;
if (info1->dir2_fh_totlen &&
!fanotify_fh_equal(fanotify_info_dir2_fh(info1),
fanotify_info_dir2_fh(info2)))
return false;
if (info1->file_fh_totlen &&
!fanotify_fh_equal(fanotify_info_file_fh(info1),
fanotify_info_file_fh(info2)))
return false;
if (info1->name_len &&
memcmp(fanotify_info_name(info1), fanotify_info_name(info2),
info1->name_len))
return false;
return !info1->name2_len ||
!memcmp(fanotify_info_name2(info1), fanotify_info_name2(info2),
info1->name2_len);
}
static bool fanotify_name_event_equal(struct fanotify_name_event *fne1,
struct fanotify_name_event *fne2)
{
struct fanotify_info *info1 = &fne1->info;
struct fanotify_info *info2 = &fne2->info;
/* Do not merge name events without dir fh */
if (!info1->dir_fh_totlen)
return false;
if (!fanotify_fsid_equal(&fne1->fsid, &fne2->fsid))
return false;
return fanotify_info_equal(info1, info2);
}
static bool fanotify_error_event_equal(struct fanotify_error_event *fee1,
struct fanotify_error_event *fee2)
{
/* Error events against the same file system are always merged. */
if (!fanotify_fsid_equal(&fee1->fsid, &fee2->fsid))
return false;
return true;
}
static bool fanotify_should_merge(struct fanotify_event *old,
struct fanotify_event *new)
{
pr_debug("%s: old=%p new=%p\n", __func__, old, new);
if (old->hash != new->hash ||
old->type != new->type || old->pid != new->pid)
return false;
/*
* We want to merge many dirent events in the same dir (i.e.
* creates/unlinks/renames), but we do not want to merge dirent
* events referring to subdirs with dirent events referring to
* non subdirs, otherwise, user won't be able to tell from a
* mask FAN_CREATE|FAN_DELETE|FAN_ONDIR if it describes mkdir+
* unlink pair or rmdir+create pair of events.
*/
if ((old->mask & FS_ISDIR) != (new->mask & FS_ISDIR))
return false;
/*
* FAN_RENAME event is reported with special info record types,
* so we cannot merge it with other events.
*/
if ((old->mask & FAN_RENAME) != (new->mask & FAN_RENAME))
return false;
switch (old->type) {
case FANOTIFY_EVENT_TYPE_PATH:
return fanotify_path_equal(fanotify_event_path(old),
fanotify_event_path(new));
case FANOTIFY_EVENT_TYPE_FID:
return fanotify_fid_event_equal(FANOTIFY_FE(old),
FANOTIFY_FE(new));
case FANOTIFY_EVENT_TYPE_FID_NAME:
return fanotify_name_event_equal(FANOTIFY_NE(old),
FANOTIFY_NE(new));
case FANOTIFY_EVENT_TYPE_FS_ERROR:
return fanotify_error_event_equal(FANOTIFY_EE(old),
FANOTIFY_EE(new));
case FANOTIFY_EVENT_TYPE_MNT:
return false;
default:
WARN_ON_ONCE(1);
}
return false;
}
/* Limit event merges to limit CPU overhead per event */
#define FANOTIFY_MAX_MERGE_EVENTS 128
/* and the list better be locked by something too! */
static int fanotify_merge(struct fsnotify_group *group,
struct fsnotify_event *event)
{
struct fanotify_event *old, *new = FANOTIFY_E(event);
unsigned int bucket = fanotify_event_hash_bucket(group, new);
struct hlist_head *hlist = &group->fanotify_data.merge_hash[bucket];
int i = 0;
pr_debug("%s: group=%p event=%p bucket=%u\n", __func__,
group, event, bucket);
/*
* Don't merge a permission event with any other event so that we know
* the event structure we have created in fanotify_handle_event() is the
* one we should check for permission response.
*/
if (fanotify_is_perm_event(new->mask))
return 0;
hlist_for_each_entry(old, hlist, merge_list) {
if (++i > FANOTIFY_MAX_MERGE_EVENTS)
break;
if (fanotify_should_merge(old, new)) {
old->mask |= new->mask;
if (fanotify_is_error_event(old->mask))
FANOTIFY_EE(old)->err_count++;
return 1;
}
}
return 0;
}
/*
* Wait for response to permission event. The function also takes care of
* freeing the permission event (or offloads that in case the wait is canceled
* by a signal). The function returns 0 in case access got allowed by userspace,
* -EPERM in case userspace disallowed the access, and -ERESTARTSYS in case
* the wait got interrupted by a signal.
*/
static int fanotify_get_response(struct fsnotify_group *group,
struct fanotify_perm_event *event,
struct fsnotify_iter_info *iter_info)
{
int ret, errno;
pr_debug("%s: group=%p event=%p\n", __func__, group, event);
ret = wait_event_state(group->fanotify_data.access_waitq,
event->state == FAN_EVENT_ANSWERED,
(TASK_KILLABLE|TASK_FREEZABLE));
/* Signal pending? */
if (ret < 0) {
spin_lock(&group->notification_lock);
/* Event reported to userspace and no answer yet? */
if (event->state == FAN_EVENT_REPORTED) {
/* Event will get freed once userspace answers to it */
event->state = FAN_EVENT_CANCELED;
spin_unlock(&group->notification_lock);
return ret;
}
/* Event not yet reported? Just remove it. */
if (event->state == FAN_EVENT_INIT) {
fsnotify_remove_queued_event(group, &event->fae.fse);
/* Permission events are not supposed to be hashed */
WARN_ON_ONCE(!hlist_unhashed(&event->fae.merge_list));
}
/*
* Event may be also answered in case signal delivery raced
* with wakeup. In that case we have nothing to do besides
* freeing the event and reporting error.
*/
spin_unlock(&group->notification_lock);
goto out;
}
/* userspace responded, convert to something usable */
switch (event->response & FANOTIFY_RESPONSE_ACCESS) {
case FAN_ALLOW:
ret = 0;
break;
case FAN_DENY:
/* Check custom errno from pre-content events */
errno = fanotify_get_response_errno(event->response);
if (errno) {
ret = -errno;
break;
}
fallthrough;
default:
ret = -EPERM;
}
/* Check if the response should be audited */
if (event->response & FAN_AUDIT) {
u32 response = event->response &
(FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS);
audit_fanotify(response & ~FAN_AUDIT, &event->audit_rule);
}
pr_debug("%s: group=%p event=%p about to return ret=%d\n", __func__,
group, event, ret);
out:
fsnotify_destroy_event(group, &event->fae.fse);
return ret;
}
/*
* This function returns a mask for an event that only contains the flags
* that have been specifically requested by the user. Flags that may have
* been included within the event mask, but have not been explicitly
* requested by the user, will not be present in the returned mask.
*/
static u32 fanotify_group_event_mask(struct fsnotify_group *group,
struct fsnotify_iter_info *iter_info,
u32 *match_mask, u32 event_mask,
const void *data, int data_type,
struct inode *dir)
{
__u32 marks_mask = 0, marks_ignore_mask = 0;
__u32 test_mask, user_mask = FANOTIFY_OUTGOING_EVENTS |
FANOTIFY_EVENT_FLAGS;
const struct path *path = fsnotify_data_path(data, data_type);
unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
struct fsnotify_mark *mark;
bool ondir = event_mask & FAN_ONDIR;
int type;
pr_debug("%s: report_mask=%x mask=%x data=%p data_type=%d\n",
__func__, iter_info->report_mask, event_mask, data, data_type);
if (FAN_GROUP_FLAG(group, FAN_REPORT_MNT)) {
if (data_type != FSNOTIFY_EVENT_MNT)
return 0;
} else if (!fid_mode) {
/* Do we have path to open a file descriptor? */
if (!path)
return 0;
/* Path type events are only relevant for files and dirs */
if (!d_is_reg(path->dentry) && !d_can_lookup(path->dentry))
return 0;
} else if (!(fid_mode & FAN_REPORT_FID)) {
/* Do we have a directory inode to report? */
if (!dir && !ondir)
return 0;
}
fsnotify_foreach_iter_mark_type(iter_info, mark, type) {
/*
* Apply ignore mask depending on event flags in ignore mask.
*/
marks_ignore_mask |=
fsnotify_effective_ignore_mask(mark, ondir, type);
/*
* Send the event depending on event flags in mark mask.
*/
if (!fsnotify_mask_applicable(mark->mask, ondir, type))
continue;
marks_mask |= mark->mask;
/* Record the mark types of this group that matched the event */
*match_mask |= 1U << type;
}
test_mask = event_mask & marks_mask & ~marks_ignore_mask;
/*
* For dirent modification events (create/delete/move) that do not carry
* the child entry name information, we report FAN_ONDIR for mkdir/rmdir
* so user can differentiate them from creat/unlink.
*
* For backward compatibility and consistency, do not report FAN_ONDIR
* to user in legacy fanotify mode (reporting fd) and report FAN_ONDIR
* to user in fid mode for all event types.
*
* We never report FAN_EVENT_ON_CHILD to user, but we do pass it in to
* fanotify_alloc_event() when group is reporting fid as indication
* that event happened on child.
*/
if (fid_mode) {
/* Do not report event flags without any event */
if (!(test_mask & ~FANOTIFY_EVENT_FLAGS))
return 0;
} else {
user_mask &= ~FANOTIFY_EVENT_FLAGS;
}
return test_mask & user_mask;
}
/*
* Check size needed to encode fanotify_fh.
*
* Return size of encoded fh without fanotify_fh header.
* Return 0 on failure to encode.
*/
static int fanotify_encode_fh_len(struct inode *inode)
{
int dwords = 0;
int fh_len;
if (!inode)
return 0;
exportfs_encode_fid(inode, NULL, &dwords);
fh_len = dwords << 2;
/*
* struct fanotify_error_event might be preallocated and is
* limited to MAX_HANDLE_SZ. This should never happen, but
* safeguard by forcing an invalid file handle.
*/
if (WARN_ON_ONCE(fh_len > MAX_HANDLE_SZ))
return 0;
return fh_len;
}
/*
* Encode fanotify_fh.
*
* Return total size of encoded fh including fanotify_fh header.
* Return 0 on failure to encode.
*/
static int fanotify_encode_fh(struct fanotify_fh *fh, struct inode *inode,
unsigned int fh_len, unsigned int *hash,
gfp_t gfp)
{
int dwords, type = 0;
char *ext_buf = NULL;
void *buf = fh + 1;
int err;
fh->type = FILEID_ROOT;
fh->len = 0;
fh->flags = 0;
/*
* Invalid FHs are used by FAN_FS_ERROR for errors not
* linked to any inode. The f_handle won't be reported
* back to userspace.
*/
if (!inode)
goto out;
/*
* !gpf means preallocated variable size fh, but fh_len could
* be zero in that case if encoding fh len failed.
*/
err = -ENOENT;
if (fh_len < 4 || WARN_ON_ONCE(fh_len % 4) || fh_len > MAX_HANDLE_SZ)
goto out_err;
/* No external buffer in a variable size allocated fh */
if (gfp && fh_len > FANOTIFY_INLINE_FH_LEN) {
/* Treat failure to allocate fh as failure to encode fh */
err = -ENOMEM;
ext_buf = kmalloc(fh_len, gfp);
if (!ext_buf)
goto out_err;
*fanotify_fh_ext_buf_ptr(fh) = ext_buf;
buf = ext_buf;
fh->flags |= FANOTIFY_FH_FLAG_EXT_BUF;
}
dwords = fh_len >> 2;
type = exportfs_encode_fid(inode, buf, &dwords);
err = -EINVAL;
/*
* Unlike file_handle, type and len of struct fanotify_fh are u8.
* Traditionally, filesystem return handle_type < 0xff, but there
* is no enforecement for that in vfs.
*/
BUILD_BUG_ON(MAX_HANDLE_SZ > 0xff || FILEID_INVALID > 0xff);
if (type <= 0 || type >= FILEID_INVALID || fh_len != dwords << 2)
goto out_err;
fh->type = type;
fh->len = fh_len;
out:
/*
* Mix fh into event merge key. Hash might be NULL in case of
* unhashed FID events (i.e. FAN_FS_ERROR).
*/
if (hash)
*hash ^= fanotify_hash_fh(fh);
return FANOTIFY_FH_HDR_LEN + fh_len;
out_err:
pr_warn_ratelimited("fanotify: failed to encode fid (type=%d, len=%d, err=%i)\n",
type, fh_len, err);
kfree(ext_buf);
*fanotify_fh_ext_buf_ptr(fh) = NULL;
/* Report the event without a file identifier on encode error */
fh->type = FILEID_INVALID;
fh->len = 0;
return 0;
}
/*
* FAN_REPORT_FID is ambiguous in that it reports the fid of the child for
* some events and the fid of the parent for create/delete/move events.
*
* With the FAN_REPORT_TARGET_FID flag, the fid of the child is reported
* also in create/delete/move events in addition to the fid of the parent
* and the name of the child.
*/
static inline bool fanotify_report_child_fid(unsigned int fid_mode, u32 mask)
{
if (mask & ALL_FSNOTIFY_DIRENT_EVENTS)
return (fid_mode & FAN_REPORT_TARGET_FID);
return (fid_mode & FAN_REPORT_FID) && !(mask & FAN_ONDIR);
}
/*
* The inode to use as identifier when reporting fid depends on the event
* and the group flags.
*
* With the group flag FAN_REPORT_TARGET_FID, always report the child fid.
*
* Without the group flag FAN_REPORT_TARGET_FID, report the modified directory
* fid on dirent events and the child fid otherwise.
*
* For example:
* FS_ATTRIB reports the child fid even if reported on a watched parent.
* FS_CREATE reports the modified dir fid without FAN_REPORT_TARGET_FID.
* and reports the created child fid with FAN_REPORT_TARGET_FID.
*/
static struct inode *fanotify_fid_inode(u32 event_mask, const void *data,
int data_type, struct inode *dir,
unsigned int fid_mode)
{
if ((event_mask & ALL_FSNOTIFY_DIRENT_EVENTS) &&
!(fid_mode & FAN_REPORT_TARGET_FID))
return dir;
return fsnotify_data_inode(data, data_type);
}
/*
* The inode to use as identifier when reporting dir fid depends on the event.
* Report the modified directory inode on dirent modification events.
* Report the "victim" inode if "victim" is a directory.
* Report the parent inode if "victim" is not a directory and event is
* reported to parent.
* Otherwise, do not report dir fid.
*/
static struct inode *fanotify_dfid_inode(u32 event_mask, const void *data,
int data_type, struct inode *dir)
{
struct inode *inode = fsnotify_data_inode(data, data_type);
if (event_mask & ALL_FSNOTIFY_DIRENT_EVENTS)
return dir;
if (inode && S_ISDIR(inode->i_mode))
return inode;
return dir;
}
static struct fanotify_event *fanotify_alloc_path_event(const struct path *path,
unsigned int *hash,
gfp_t gfp)
{
struct fanotify_path_event *pevent;
pevent = kmem_cache_alloc(fanotify_path_event_cachep, gfp);
if (!pevent)
return NULL;
pevent->fae.type = FANOTIFY_EVENT_TYPE_PATH;
pevent->path = *path;
*hash ^= fanotify_hash_path(path);
path_get(path);
return &pevent->fae;
}
static struct fanotify_event *fanotify_alloc_mnt_event(u64 mnt_id, gfp_t gfp)
{
struct fanotify_mnt_event *pevent;
pevent = kmem_cache_alloc(fanotify_mnt_event_cachep, gfp);
if (!pevent)
return NULL;
pevent->fae.type = FANOTIFY_EVENT_TYPE_MNT;
pevent->mnt_id = mnt_id;
return &pevent->fae;
}
static struct fanotify_event *fanotify_alloc_perm_event(const void *data,
int data_type,
gfp_t gfp)
{
const struct path *path = fsnotify_data_path(data, data_type);
const struct file_range *range =
fsnotify_data_file_range(data, data_type);
struct fanotify_perm_event *pevent;
pevent = kmem_cache_alloc(fanotify_perm_event_cachep, gfp);
if (!pevent)
return NULL;
pevent->fae.type = FANOTIFY_EVENT_TYPE_PATH_PERM;
pevent->response = 0;
pevent->hdr.type = FAN_RESPONSE_INFO_NONE;
pevent->hdr.pad = 0;
pevent->hdr.len = 0;
pevent->state = FAN_EVENT_INIT;
pevent->path = *path;
/* NULL ppos means no range info */
pevent->ppos = range ? &range->pos : NULL;
pevent->count = range ? range->count : 0;
path_get(path);
return &pevent->fae;
}
static struct fanotify_event *fanotify_alloc_fid_event(struct inode *id,
__kernel_fsid_t *fsid,
unsigned int *hash,
gfp_t gfp)
{
struct fanotify_fid_event *ffe;
ffe = kmem_cache_alloc(fanotify_fid_event_cachep, gfp);
if (!ffe)
return NULL;
ffe->fae.type = FANOTIFY_EVENT_TYPE_FID;
ffe->fsid = *fsid;
*hash ^= fanotify_hash_fsid(fsid);
fanotify_encode_fh(&ffe->object_fh, id, fanotify_encode_fh_len(id),
hash, gfp);
return &ffe->fae;
}
static struct fanotify_event *fanotify_alloc_name_event(struct inode *dir,
__kernel_fsid_t *fsid,
const struct qstr *name,
struct inode *child,
struct dentry *moved,
unsigned int *hash,
gfp_t gfp)
{
struct fanotify_name_event *fne;
struct fanotify_info *info;
struct fanotify_fh *dfh, *ffh;
struct inode *dir2 = moved ? d_inode(moved->d_parent) : NULL;
const struct qstr *name2 = moved ? &moved->d_name : NULL;
unsigned int dir_fh_len = fanotify_encode_fh_len(dir);
unsigned int dir2_fh_len = fanotify_encode_fh_len(dir2);
unsigned int child_fh_len = fanotify_encode_fh_len(child);
unsigned long name_len = name ? name->len : 0;
unsigned long name2_len = name2 ? name2->len : 0;
unsigned int len, size;
/* Reserve terminating null byte even for empty name */
size = sizeof(*fne) + name_len + name2_len + 2;
if (dir_fh_len)
size += FANOTIFY_FH_HDR_LEN + dir_fh_len;
if (dir2_fh_len)
size += FANOTIFY_FH_HDR_LEN + dir2_fh_len;
if (child_fh_len)
size += FANOTIFY_FH_HDR_LEN + child_fh_len;
fne = kmalloc(size, gfp);
if (!fne)
return NULL;
fne->fae.type = FANOTIFY_EVENT_TYPE_FID_NAME;
fne->fsid = *fsid;
*hash ^= fanotify_hash_fsid(fsid);
info = &fne->info;
fanotify_info_init(info);
if (dir_fh_len) {
dfh = fanotify_info_dir_fh(info);
len = fanotify_encode_fh(dfh, dir, dir_fh_len, hash, 0);
fanotify_info_set_dir_fh(info, len);
}
if (dir2_fh_len) {
dfh = fanotify_info_dir2_fh(info);
len = fanotify_encode_fh(dfh, dir2, dir2_fh_len, hash, 0);
fanotify_info_set_dir2_fh(info, len);
}
if (child_fh_len) {
ffh = fanotify_info_file_fh(info);
len = fanotify_encode_fh(ffh, child, child_fh_len, hash, 0);
fanotify_info_set_file_fh(info, len);
}
if (name_len) {
fanotify_info_copy_name(info, name);
*hash ^= full_name_hash((void *)name_len, name->name, name_len);
}
if (name2_len) {
fanotify_info_copy_name2(info, name2);
*hash ^= full_name_hash((void *)name2_len, name2->name,
name2_len);
}
pr_debug("%s: size=%u dir_fh_len=%u child_fh_len=%u name_len=%u name='%.*s'\n",
__func__, size, dir_fh_len, child_fh_len,
info->name_len, info->name_len, fanotify_info_name(info));
if (dir2_fh_len) {
pr_debug("%s: dir2_fh_len=%u name2_len=%u name2='%.*s'\n",
__func__, dir2_fh_len, info->name2_len,
info->name2_len, fanotify_info_name2(info));
}
return &fne->fae;
}
static struct fanotify_event *fanotify_alloc_error_event(
struct fsnotify_group *group,
__kernel_fsid_t *fsid,
const void *data, int data_type,
unsigned int *hash)
{
struct fs_error_report *report =
fsnotify_data_error_report(data, data_type);
struct inode *inode;
struct fanotify_error_event *fee;
int fh_len;
if (WARN_ON_ONCE(!report))
return NULL;
fee = mempool_alloc(&group->fanotify_data.error_events_pool, GFP_NOFS);
if (!fee)
return NULL;
fee->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
fee->error = report->error;
fee->err_count = 1;
fee->fsid = *fsid;
inode = report->inode;
fh_len = fanotify_encode_fh_len(inode);
/* Bad fh_len. Fallback to using an invalid fh. Should never happen. */
if (!fh_len && inode)
inode = NULL;
fanotify_encode_fh(&fee->object_fh, inode, fh_len, NULL, 0);
*hash ^= fanotify_hash_fsid(fsid);
return &fee->fae;
}
static struct fanotify_event *fanotify_alloc_event(
struct fsnotify_group *group,
u32 mask, const void *data, int data_type,
struct inode *dir, const struct qstr *file_name,
__kernel_fsid_t *fsid, u32 match_mask)
{
struct fanotify_event *event = NULL;
gfp_t gfp = GFP_KERNEL_ACCOUNT;
unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
struct inode *id = fanotify_fid_inode(mask, data, data_type, dir,
fid_mode);
struct inode *dirid = fanotify_dfid_inode(mask, data, data_type, dir);
const struct path *path = fsnotify_data_path(data, data_type);
u64 mnt_id = fsnotify_data_mnt_id(data, data_type);
struct mem_cgroup *old_memcg;
struct dentry *moved = NULL;
struct inode *child = NULL;
bool name_event = false;
unsigned int hash = 0;
bool ondir = mask & FAN_ONDIR;
struct pid *pid;
if ((fid_mode & FAN_REPORT_DIR_FID) && dirid) {
/*
* For certain events and group flags, report the child fid
* in addition to reporting the parent fid and maybe child name.
*/
if (fanotify_report_child_fid(fid_mode, mask) && id != dirid)
child = id;
id = dirid;
/*
* We record file name only in a group with FAN_REPORT_NAME
* and when we have a directory inode to report.
*
* For directory entry modification event, we record the fid of
* the directory and the name of the modified entry.
*
* For event on non-directory that is reported to parent, we
* record the fid of the parent and the name of the child.
*
* Even if not reporting name, we need a variable length
* fanotify_name_event if reporting both parent and child fids.
*/
if (!(fid_mode & FAN_REPORT_NAME)) {
name_event = !!child;
file_name = NULL;
} else if ((mask & ALL_FSNOTIFY_DIRENT_EVENTS) || !ondir) {
name_event = true;
}
/*
* In the special case of FAN_RENAME event, use the match_mask
* to determine if we need to report only the old parent+name,
* only the new parent+name or both.
* 'dirid' and 'file_name' are the old parent+name and
* 'moved' has the new parent+name.
*/
if (mask & FAN_RENAME) {
bool report_old, report_new;
if (WARN_ON_ONCE(!match_mask))
return NULL;
/* Report both old and new parent+name if sb watching */
report_old = report_new =
match_mask & (1U << FSNOTIFY_ITER_TYPE_SB);
report_old |=
match_mask & (1U << FSNOTIFY_ITER_TYPE_INODE);
report_new |=
match_mask & (1U << FSNOTIFY_ITER_TYPE_INODE2);
if (!report_old) {
/* Do not report old parent+name */
dirid = NULL;
file_name = NULL;
}
if (report_new) {
/* Report new parent+name */
moved = fsnotify_data_dentry(data, data_type);
}
}
}
/*
* For queues with unlimited length lost events are not expected and
* can possibly have security implications. Avoid losing events when
* memory is short. For the limited size queues, avoid OOM killer in the
* target monitoring memcg as it may have security repercussion.
*/
if (group->max_events == UINT_MAX)
gfp |= __GFP_NOFAIL;
else
gfp |= __GFP_RETRY_MAYFAIL;
/* Whoever is interested in the event, pays for the allocation. */
old_memcg = set_active_memcg(group->memcg);
if (fanotify_is_perm_event(mask)) {
event = fanotify_alloc_perm_event(data, data_type, gfp);
} else if (fanotify_is_error_event(mask)) {
event = fanotify_alloc_error_event(group, fsid, data,
data_type, &hash);
} else if (name_event && (file_name || moved || child)) {
event = fanotify_alloc_name_event(dirid, fsid, file_name, child,
moved, &hash, gfp);
} else if (fid_mode) {
event = fanotify_alloc_fid_event(id, fsid, &hash, gfp);
} else if (path) {
event = fanotify_alloc_path_event(path, &hash, gfp);
} else if (mnt_id) {
event = fanotify_alloc_mnt_event(mnt_id, gfp);
} else {
WARN_ON_ONCE(1);
}
if (!event)
goto out;
if (FAN_GROUP_FLAG(group, FAN_REPORT_TID))
pid = get_pid(task_pid(current));
else
pid = get_pid(task_tgid(current));
/* Mix event info, FAN_ONDIR flag and pid into event merge key */
hash ^= hash_long((unsigned long)pid | ondir, FANOTIFY_EVENT_HASH_BITS);
fanotify_init_event(event, hash, mask);
event->pid = pid;
out:
set_active_memcg(old_memcg);
return event;
}
/*
* Get cached fsid of the filesystem containing the object from any mark.
* All marks are supposed to have the same fsid, but we do not verify that here.
*/
static __kernel_fsid_t fanotify_get_fsid(struct fsnotify_iter_info *iter_info)
{
struct fsnotify_mark *mark;
int type;
__kernel_fsid_t fsid = {};
fsnotify_foreach_iter_mark_type(iter_info, mark, type) {
if (!(mark->flags & FSNOTIFY_MARK_FLAG_HAS_FSID))
continue;
fsid = FANOTIFY_MARK(mark)->fsid;
if (!(mark->flags & FSNOTIFY_MARK_FLAG_WEAK_FSID) &&
WARN_ON_ONCE(!fsid.val[0] && !fsid.val[1]))
continue;
return fsid;
}
return fsid;
}
/*
* Add an event to hash table for faster merge.
*/
static void fanotify_insert_event(struct fsnotify_group *group,
struct fsnotify_event *fsn_event)
{
struct fanotify_event *event = FANOTIFY_E(fsn_event);
unsigned int bucket = fanotify_event_hash_bucket(group, event);
struct hlist_head *hlist = &group->fanotify_data.merge_hash[bucket];
assert_spin_locked(&group->notification_lock);
if (!fanotify_is_hashed_event(event->mask))
return;
pr_debug("%s: group=%p event=%p bucket=%u\n", __func__,
group, event, bucket);
hlist_add_head(&event->merge_list, hlist);
}
static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
const void *data, int data_type,
struct inode *dir,
const struct qstr *file_name, u32 cookie,
struct fsnotify_iter_info *iter_info)
{
int ret = 0;
struct fanotify_event *event;
struct fsnotify_event *fsn_event;
__kernel_fsid_t fsid = {};
u32 match_mask = 0;
BUILD_BUG_ON(FAN_ACCESS != FS_ACCESS);
BUILD_BUG_ON(FAN_MODIFY != FS_MODIFY);
BUILD_BUG_ON(FAN_ATTRIB != FS_ATTRIB);
BUILD_BUG_ON(FAN_CLOSE_NOWRITE != FS_CLOSE_NOWRITE);
BUILD_BUG_ON(FAN_CLOSE_WRITE != FS_CLOSE_WRITE);
BUILD_BUG_ON(FAN_OPEN != FS_OPEN);
BUILD_BUG_ON(FAN_MOVED_TO != FS_MOVED_TO);
BUILD_BUG_ON(FAN_MOVED_FROM != FS_MOVED_FROM);
BUILD_BUG_ON(FAN_CREATE != FS_CREATE);
BUILD_BUG_ON(FAN_DELETE != FS_DELETE);
BUILD_BUG_ON(FAN_DELETE_SELF != FS_DELETE_SELF);
BUILD_BUG_ON(FAN_MOVE_SELF != FS_MOVE_SELF);
BUILD_BUG_ON(FAN_EVENT_ON_CHILD != FS_EVENT_ON_CHILD);
BUILD_BUG_ON(FAN_Q_OVERFLOW != FS_Q_OVERFLOW);
BUILD_BUG_ON(FAN_OPEN_PERM != FS_OPEN_PERM);
BUILD_BUG_ON(FAN_ACCESS_PERM != FS_ACCESS_PERM);
BUILD_BUG_ON(FAN_ONDIR != FS_ISDIR);
BUILD_BUG_ON(FAN_OPEN_EXEC != FS_OPEN_EXEC);
BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);
BUILD_BUG_ON(FAN_RENAME != FS_RENAME);
BUILD_BUG_ON(FAN_PRE_ACCESS != FS_PRE_ACCESS);
BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 24);
mask = fanotify_group_event_mask(group, iter_info, &match_mask,
mask, data, data_type, dir);
if (!mask)
return 0;
pr_debug("%s: group=%p mask=%x report_mask=%x\n", __func__,
group, mask, match_mask);
if (fanotify_is_perm_event(mask)) {
/*
* fsnotify_prepare_user_wait() fails if we race with mark
* deletion. Just let the operation pass in that case.
*/
if (!fsnotify_prepare_user_wait(iter_info))
return 0;
}
if (FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS))
fsid = fanotify_get_fsid(iter_info);
event = fanotify_alloc_event(group, mask, data, data_type, dir,
file_name, &fsid, match_mask);
ret = -ENOMEM;
if (unlikely(!event)) {
/*
* We don't queue overflow events for permission events as
* there the access is denied and so no event is in fact lost.
*/
if (!fanotify_is_perm_event(mask))
fsnotify_queue_overflow(group);
goto finish;
}
fsn_event = &event->fse;
ret = fsnotify_insert_event(group, fsn_event, fanotify_merge,
fanotify_insert_event);
if (ret) {
/* Permission events shouldn't be merged */
BUG_ON(ret == 1 && mask & FANOTIFY_PERM_EVENTS);
/* Our event wasn't used in the end. Free it. */
fsnotify_destroy_event(group, fsn_event);
ret = 0;
} else if (fanotify_is_perm_event(mask)) {
ret = fanotify_get_response(group, FANOTIFY_PERM(event),
iter_info);
}
finish:
if (fanotify_is_perm_event(mask))
fsnotify_finish_user_wait(iter_info);
return ret;
}
static void fanotify_free_group_priv(struct fsnotify_group *group)
{
put_user_ns(group->user_ns);
kfree(group->fanotify_data.merge_hash);
if (group->fanotify_data.ucounts)
dec_ucount(group->fanotify_data.ucounts,
UCOUNT_FANOTIFY_GROUPS);
if (mempool_initialized(&group->fanotify_data.error_events_pool))
mempool_exit(&group->fanotify_data.error_events_pool);
}
static void fanotify_free_path_event(struct fanotify_event *event)
{
path_put(fanotify_event_path(event));
kmem_cache_free(fanotify_path_event_cachep, FANOTIFY_PE(event));
}
static void fanotify_free_perm_event(struct fanotify_event *event)
{
path_put(fanotify_event_path(event));
kmem_cache_free(fanotify_perm_event_cachep, FANOTIFY_PERM(event));
}
static void fanotify_free_fid_event(struct fanotify_event *event)
{
struct fanotify_fid_event *ffe = FANOTIFY_FE(event);
if (fanotify_fh_has_ext_buf(&ffe->object_fh))
kfree(fanotify_fh_ext_buf(&ffe->object_fh));
kmem_cache_free(fanotify_fid_event_cachep, ffe);
}
static void fanotify_free_name_event(struct fanotify_event *event)
{
kfree(FANOTIFY_NE(event));
}
static void fanotify_free_error_event(struct fsnotify_group *group,
struct fanotify_event *event)
{
struct fanotify_error_event *fee = FANOTIFY_EE(event);
mempool_free(fee, &group->fanotify_data.error_events_pool);
}
static void fanotify_free_mnt_event(struct fanotify_event *event)
{
kmem_cache_free(fanotify_mnt_event_cachep, FANOTIFY_ME(event));
}
static void fanotify_free_event(struct fsnotify_group *group,
struct fsnotify_event *fsn_event)
{
struct fanotify_event *event;
event = FANOTIFY_E(fsn_event);
put_pid(event->pid);
switch (event->type) {
case FANOTIFY_EVENT_TYPE_PATH:
fanotify_free_path_event(event);
break;
case FANOTIFY_EVENT_TYPE_PATH_PERM:
fanotify_free_perm_event(event);
break;
case FANOTIFY_EVENT_TYPE_FID:
fanotify_free_fid_event(event);
break;
case FANOTIFY_EVENT_TYPE_FID_NAME:
fanotify_free_name_event(event);
break;
case FANOTIFY_EVENT_TYPE_OVERFLOW:
kfree(event);
break;
case FANOTIFY_EVENT_TYPE_FS_ERROR:
fanotify_free_error_event(group, event);
break;
case FANOTIFY_EVENT_TYPE_MNT:
fanotify_free_mnt_event(event);
break;
default:
WARN_ON_ONCE(1);
}
}
static void fanotify_freeing_mark(struct fsnotify_mark *mark,
struct fsnotify_group *group)
{
if (!FAN_GROUP_FLAG(group, FAN_UNLIMITED_MARKS))
dec_ucount(group->fanotify_data.ucounts, UCOUNT_FANOTIFY_MARKS);
}
static void fanotify_free_mark(struct fsnotify_mark *fsn_mark)
{
kmem_cache_free(fanotify_mark_cache, FANOTIFY_MARK(fsn_mark));
}
const struct fsnotify_ops fanotify_fsnotify_ops = {
.handle_event = fanotify_handle_event,
.free_group_priv = fanotify_free_group_priv,
.free_event = fanotify_free_event,
.freeing_mark = fanotify_freeing_mark,
.free_mark = fanotify_free_mark,
};
] |
| 20/2 | 2026/03/26 00:25 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 27028 output: 47 thoughts: 902 Thoughts: **Analyzing a Potential Data Race in fsnotify**
Okay, so I'm looking at this KCSAN report about a potential data race within the `fsnotify` code, specifically between `inotify_handle_inode_event` and `fsnotify_free_mark`. It appears the race is centered around the `inode_mark->flags` field. `inotify` seems to have this `IN_ONESHOT` flag, which is checked in `inotify_handle_inode_event`. The crucial point is that `fsnotify_free_mark` *clears* the `FSNOTIFY_MARK_FLAG_ALIVE` bit, and the code in question reads `FSNOTIFY_MARK_FLAG_IN_ONESHOT`, but I don't see any explicit locking around the access in `inotify_handle_inode_event`.
My immediate thought is that, while the bits are distinct (0x0001 and 0x0020), and 32-bit atomic accesses should be fine on this 64-bit system, the Linux memory model emphasizes data races are a problem unless they are explicitly atomic or protected by locks. So, KCSAN is technically correct in flagging it.
The interesting part is determining the impact. If the read sees `IN_ONESHOT` as 0 when it should be 1, it *might* miss calling `fsnotify_destroy_mark`, but since `fsnotify_free_mark` is already running, the mark is already in the process of being destroyed. That part seems benign. If it incorrectly sees `IN_ONESHOT` as 1 when it should be 0, it *will* call `fsnotify_destroy_mark`, but it then will encounter internal checks and bail. It seems like the mark might already be in the process of freeing. Again, benign. It still seems benign.
So, from a correctness standpoint, this race appears harmless. However, I know better than to ignore compiler optimizations. To make KCSAN happy (and avoid *potential* compiler weirdness), I'm leaning toward using either `READ_ONCE` in `inotify_handle_inode_event` to get the value, or potentially adding a lock, although I'd prefer `READ_ONCE`.
Wait a minute, `fsnotify_free_mark` *does* clear `ALIVE` under the mark lock! And `inotify_handle_inode_event` reads `flags` *without* the lock. Yep, that seems like a clear use case for `READ_ONCE`.
Now that I'm looking more closely, I just spotted similar potential races elsewhere in `fsnotify`. For instance, `fanotify_get_fsid` also reads `mark->flags` without a lock. Specifically:
```c
811: fsnotify_foreach_iter_mark_type(iter_info, mark, type) {
812: if (!(mark->flags & FSNOTIFY_MARK_FLAG_HAS_FSID))
813: continue;
```
I need to check `fsnotify_effective_ignore_mask` in `include/linux/fsnotify_backend.h` to make sure there are no other issues. It looks like it could race with `fsnotify_free_mark` too. Looks like there could be many spots that could race.
|
| 21/2 | 2026/03/26 00:25 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:include/linux/fsnotify_backend.h IncludeLines:true Name:fsnotify_effective_ignore_mask] Results: map[SourceCode: 824: /*
825: * Effective ignore mask taking into account if event victim is a
826: * directory and whether it is reported to a watching parent.
827: */
828: static inline __u32 fsnotify_effective_ignore_mask(struct fsnotify_mark *mark,
829: bool is_dir, int iter_type)
830: {
831: __u32 ignore_mask = fsnotify_ignored_events(mark);
832:
833: if (!ignore_mask)
834: return 0;
835:
836: /* For non-dir and non-child, no need to consult the event flags */
837: if (!is_dir && iter_type != FSNOTIFY_ITER_TYPE_PARENT)
838: return ignore_mask;
839:
840: ignore_mask = fsnotify_ignore_mask(mark);
841: if (!fsnotify_mask_applicable(ignore_mask, is_dir, iter_type))
842: return 0;
843:
844: return ignore_mask & ALL_FSNOTIFY_EVENTS;
845: }
SourceFile:include/linux/fsnotify_backend.h] |
| 22/2 | 2026/03/26 00:25 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 29087 output: 90 thoughts: 31 Thoughts: Okay, here's my interpretation of those thoughts, tailored for an expert audience: **Investigating Potential Data Corruption in File System Event Handling** Hmm, a potential issue is bubbling up. I'm starting to suspect that the `fsnotify_ignored_events` and `fsnotify_ignore_mask` parameters might be incorrectly reading from the `flags` field, and not acting as expected. Given the potential for subtle data corruption or unexpected behavior in our event processing, this warrants immediate investigation. We need to verify how these components are actually interpreting and applying those `flags` values. Let's dig in and check the implementation details of those functions. |
| 23/2 | 2026/03/26 00:25 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:include/linux/fsnotify_backend.h IncludeLines:true Name:fsnotify_ignored_events] Results: map[SourceCode: 799: /* Legacy ignored_mask - only event types to ignore */
800: static inline __u32 fsnotify_ignored_events(struct fsnotify_mark *mark)
801: {
802: return mark->ignore_mask & ALL_FSNOTIFY_EVENTS;
803: }
SourceFile:include/linux/fsnotify_backend.h] |
| 24/2 | 2026/03/26 00:25 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:include/linux/fsnotify_backend.h IncludeLines:true Name:fsnotify_ignore_mask] Results: map[SourceCode: 771: /*
772: * Canonical "ignore mask" including event flags.
773: *
774: * Note the subtle semantic difference from the legacy ->ignored_mask.
775: * ->ignored_mask traditionally only meant which events should be ignored,
776: * while ->ignore_mask also includes flags regarding the type of objects on
777: * which events should be ignored.
778: */
779: static inline __u32 fsnotify_ignore_mask(struct fsnotify_mark *mark)
780: {
781: __u32 ignore_mask = mark->ignore_mask;
782:
783: /* The event flags in ignore mask take effect */
784: if (mark->flags & FSNOTIFY_MARK_FLAG_HAS_IGNORE_FLAGS)
785: return ignore_mask;
786:
787: /*
788: * Legacy behavior:
789: * - Always ignore events on dir
790: * - Ignore events on child if parent is watching children
791: */
792: ignore_mask |= FS_ISDIR;
793: ignore_mask &= ~FS_EVENT_ON_CHILD;
794: ignore_mask |= mark->mask & FS_EVENT_ON_CHILD;
795:
796: return ignore_mask;
797: }
SourceFile:include/linux/fsnotify_backend.h] |
| 25/2 | 2026/03/26 00:25 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 29884 output: 50 thoughts: 1203 Thoughts: **Analyzing a Data Race in the `fsnotify` Subsystem**
Okay, so I'm looking at this KCSAN report. It flags a potential data race between a write to `mark->flags` and a read from `mark->flags`. Specifically, the write happens in `fsnotify_free_mark` (line 584), where we clear the `FSNOTIFY_MARK_FLAG_ALIVE` bit. The read is in `inotify_handle_inode_event` (line 132), checking `FSNOTIFY_MARK_FLAG_IN_ONESHOT`.
My initial thought is that on x86, a 32-bit integer write *should* be atomic. Therefore, the read would either see the old value of `flags` (with `FSNOTIFY_MARK_FLAG_ALIVE` set) or the new value (with it cleared). In either case, the `FSNOTIFY_MARK_FLAG_IN_ONESHOT` bit should remain consistent.
However, a compiler optimization could potentially cause issues. The compiler might use smaller accesses, even though `flags` is a 32-bit `unsigned int`. The report says that the `value changed: 0x00000001 -> 0x00000000`, so `FSNOTIFY_MARK_FLAG_IN_ONESHOT` (0x0020) wasn't set. The code path *should* work if `IN_ONESHOT` isn't set.
But let's think it through more carefully. If `IN_ONESHOT` is *not* set, then the read at line 132 would return 0. The write at line 584 would change `flags` from 0x0001 to 0x0000. The read at line 132 would *still* return 0. The system would behave as intended.
If `IN_ONESHOT` *is* set, then the code *might* have a problem. In that case, `flags` would be 0x0021. If Thread A then clears bit 0, and Thread B then checks bit 5, Thread B could theoretically see an intermediate state. That seems unlikely with atomic 32-bit writes on x86.
So, let's look at the compiler optimizations. If the code is:
```c
if (flags & 0x0020) {
...
}
```
If it's turned into
```c
tmp = flags;
if (tmp & 0x0020) {
...
}
```
That's fine. The read of `flags` is safe.
But if it looks like this:
```c
if (flags & 0x0020) {
do_something(flags);
}
```
And it reloads `flags` for `do_something(flags)`, we have another read of the flag.
But `fsnotify_destroy_mark` takes `inode_mark` and `group`, not `flags`.
Looking at the code more carefully, the `flags` value is not used after the check.
```c
132: if (inode_mark->flags & FSNOTIFY_MARK_FLAG_IN_ONESHOT)
133: fsnotify_destroy_mark(inode_mark, group);
```
So it does seem benign. It's just a check of one bit in a flag.
Now, let's retrace the steps. The write comes from `fsnotify_free_mark`, called by `fsnotify_destroy_mark`, which is called by `__do_sys_inotify_rm_watch`. The read is in `inotify_handle_inode_event`. If `inotify_rm_watch` is called (to remove the watch) and an event happens concurrently, there's a race regardless. The event's reporting depends on whether the watch is removed *before* or *after* event processing.
Importantly, `inotify_handle_inode_event` checks `wd` first:
```c
87: wd = READ_ONCE(i_mark->wd);
88: if (wd == -1)
89: return 0;
```
`wd` is set to -1 when the mark is being destroyed, likely in `inotify_freeing_mark`. So there's a very fast fail.
So, in conclusion: It looks like KCSAN is reporting a data race that *is* technically there
|
| 26/2 | 2026/03/26 00:26 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/notify/inotify/inotify_fsnotify.c IncludeLines:true Name:inotify_freeing_mark] Results: map[SourceCode: 138: static void inotify_freeing_mark(struct fsnotify_mark *fsn_mark, struct fsnotify_group *group)
139: {
140: inotify_ignored_and_remove_idr(fsn_mark, group);
141: }
SourceFile:fs/notify/inotify/inotify_fsnotify.c] |
| 27/2 | 2026/03/26 00:26 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 32278 output: 53 thoughts: 17 Thoughts: Okay, here's my interpretation of that thought process, assuming I'm an expert in this field, presented as a first-person summary: **Digging into `inotify_ignored_and_remove_idr`** Alright, time to get my hands dirty. Let's see... the thought process is to investigate `inotify_ignored_and_remove_idr`. That's a key function, I can tell just by the name. It's likely involved in handling ignored events related to inotify and managing entries within some sort of IDR (ID Red-Black Tree). My hunch is it's responsible for the cleanup side of things: removing entries from the IDR that are no longer relevant, potentially after an event has been ignored. It's got "remove" right there in the function name so that's a clue. I need to understand exactly what triggers an event to be considered "ignored". Is it related to filtering? Expiry? Resource limits? Knowing this will help me contextualize the removal process, any potential concurrency implications, and any potential for resource leaks or vulnerabilities. Right, let's dive in. I need to trace its call sites, understand the input parameters, and how it interacts with other related data structures. This is going to be crucial for figuring out how the inotify subsystem handles these ignored events. |
| 28/2 | 2026/03/26 00:26 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/notify/inotify/inotify_user.c IncludeLines:true Name:inotify_ignored_and_remove_idr] Results: map[SourceCode: 517: /*
518: * Send IN_IGNORED for this wd, remove this wd from the idr.
519: */
520: void inotify_ignored_and_remove_idr(struct fsnotify_mark *fsn_mark,
521: struct fsnotify_group *group)
522: {
523: struct inotify_inode_mark *i_mark;
524:
525: /* Queue ignore event for the watch */
526: inotify_handle_inode_event(fsn_mark, FS_IN_IGNORED, NULL, NULL, NULL,
527: 0);
528:
529: i_mark = container_of(fsn_mark, struct inotify_inode_mark, fsn_mark);
530: /* remove this mark from the idr */
531: inotify_remove_from_idr(group, i_mark);
532:
533: dec_inotify_watches(group->inotify_data.ucounts);
534: }
SourceFile:fs/notify/inotify/inotify_user.c] |
| 29/2 | 2026/03/26 00:26 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 32973 output: 51 thoughts: 12 Thoughts: Okay, here's my interpretation of that thought process, tailored for an expert audience, and written in the first person: **Navigating the Inotify IDR** Right, let me see... I'm thinking about the `inotify_remove_from_idr` function. This is key, obviously, to handling the cleanup of inotify watches. My mind's automatically jumping through the considerations here. We're talking about removing an entry, presumably a watch structure, from the IDR (Integer ID Registry). So, what are the implicit assumptions? We've already ensured that the watch is no longer needed, that the related file descriptor is being closed, or the file itself is going away. We've likely acquired the necessary locks -- probably the inotify mutex or some other relevant lock to ensure atomicity. The function has got to handle the potential race conditions, right? We have to ensure we're not removing something that's still actively in use, or that another thread hasn't already snatched the entry. There's probably careful checking of the refcount or some other mechanism within the IDR itself, to ensure the entry hasn't been deallocated or modified in a way that would cause problems. It's a critical section, and any failure would be catastrophic, leading to either resource leaks or, worse, use-after-free scenarios. I'm considering the potential impact on other inotify operations. Removing an entry could trigger notifications regarding file deletions or changes depending on the configuration. There's the bookkeeping related to the IDR itself – making sure the data structures are updated, the ID is freed appropriately, and that no memory leaks occur. What kind of error handling is implemented? What are the potential consequences of errors within this function? I am wondering if the return value of this function, if any, is being checked for error handling, or the IDR itself may have another operation to handle the error code. It's a delicate dance, really. Every step needs to be meticulously orchestrated to maintain the integrity of the inotify subsystem. The IDR is a fundamental part of the design, and this function is the linchpin for proper cleanup. We can't mess this up. |
| 30/2 | 2026/03/26 00:26 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/notify/inotify/inotify_user.c IncludeLines:true Name:inotify_remove_from_idr] Results: map[SourceCode: 448: /*
449: * Remove the mark from the idr (if present) and drop the reference
450: * on the mark because it was in the idr.
451: */
452: static void inotify_remove_from_idr(struct fsnotify_group *group,
453: struct inotify_inode_mark *i_mark)
454: {
455: struct idr *idr = &group->inotify_data.idr;
456: spinlock_t *idr_lock = &group->inotify_data.idr_lock;
457: struct inotify_inode_mark *found_i_mark = NULL;
458: int wd;
459:
460: spin_lock(idr_lock);
461: wd = i_mark->wd;
462:
463: /*
464: * does this i_mark think it is in the idr? we shouldn't get called
465: * if it wasn't....
466: */
467: if (wd == -1) {
468: WARN_ONCE(1, "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p\n",
469: __func__, i_mark, i_mark->wd, i_mark->fsn_mark.group);
470: goto out;
471: }
472:
473: /* Lets look in the idr to see if we find it */
474: found_i_mark = inotify_idr_find_locked(group, wd);
475: if (unlikely(!found_i_mark)) {
476: WARN_ONCE(1, "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p\n",
477: __func__, i_mark, i_mark->wd, i_mark->fsn_mark.group);
478: goto out;
479: }
480:
481: /*
482: * We found an mark in the idr at the right wd, but it's
483: * not the mark we were told to remove. eparis seriously
484: * fucked up somewhere.
485: */
486: if (unlikely(found_i_mark != i_mark)) {
487: WARN_ONCE(1, "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p "
488: "found_i_mark=%p found_i_mark->wd=%d "
489: "found_i_mark->group=%p\n", __func__, i_mark,
490: i_mark->wd, i_mark->fsn_mark.group, found_i_mark,
491: found_i_mark->wd, found_i_mark->fsn_mark.group);
492: goto out;
493: }
494:
495: /*
496: * One ref for being in the idr
497: * one ref grabbed by inotify_idr_find
498: */
499: if (unlikely(refcount_read(&i_mark->fsn_mark.refcnt) < 2)) {
500: printk(KERN_ERR "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p\n",
501: __func__, i_mark, i_mark->wd, i_mark->fsn_mark.group);
502: /* we can't really recover with bad ref cnting.. */
503: BUG();
504: }
505:
506: idr_remove(idr, wd);
507: /* Removed from the idr, drop that ref. */
508: fsnotify_put_mark(&i_mark->fsn_mark);
509: out:
510: i_mark->wd = -1;
511: spin_unlock(idr_lock);
512: /* match the ref taken by inotify_idr_find_locked() */
513: if (found_i_mark)
514: fsnotify_put_mark(&found_i_mark->fsn_mark);
515: }
SourceFile:fs/notify/inotify/inotify_user.c] |
| 31/2 | 2026/03/26 00:26 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 34760 output: 27 thoughts: 896 Thoughts: **Analyzing a Potential Race Condition in the Inotify Implementation** Okay, so I'm looking at this sequence and thinking about a possible race condition. I see that `i_mark->wd = -1` gets set at line 510, which is reached through the `inotify_freeing_mark` function, itself called from `fsnotify_free_mark`. This is happening after the `mark->flags &= ~FSNOTIFY_MARK_FLAG_ALIVE;` operation, which is the key thing here. The order of operations is: first, the `ALIVE` flag is cleared, then the `freeing_mark` callback is called, which eventually leads to setting the watch descriptor to `-1`. Now, in `inotify_handle_inode_event`, at lines 87-89, the code reads the `wd` field. If `wd` is -1, it returns early. The crucial point is at lines 132-133: if the `IN_ONESHOT` flag is set, then `fsnotify_destroy_mark` is called. The logic is that if `wd` is not -1, `ALIVE` might be set or just cleared. If `ALIVE` is cleared, we're in the `fsnotify_free_mark` function, which is the area where the race condition might exist. The race is between the write to `mark->flags` and the read of `mark->flags`. The timing here is key. If `inotify_handle_inode_event` is triggered, it's because an event has occurred. If the mark is being destroyed, we might still want to report the event, especially if it happened before the mark was detached. The race on the `flags` is just a bitwise check. I need to make sure the compiler hasn't reordered the code. If it had reordered the write to flags and the call to `freeing_mark`, that could be a problem, but that's not possible. The flags are modified before the function call, and the lock is released first. The critical section releases `mark->lock`, then clears `mark->flags` and, finally, it calls `freeing_mark`. So, the race is definitely there, but is it harmful? I can't immediately see how it would lead to a crash or memory corruption. The worst case is that `fsnotify_destroy_mark` is called multiple times, but the function's supposed to handle that gracefully. Alternatively, it might not be called when it should be, but since the mark is being destroyed, it's ok. I think I'll be fine. Now, wait a minute, what if `IN_ONESHOT` is cleared? That's when `fsnotify_destroy_mark` is not called. `IN_ONESHOT` is only cleared by the kernel when the mark is destroyed. It's set by the user, and it's a part of the `mask`. It looks like `FSNOTIFY_MARK_FLAG_IN_ONESHOT` is a flag in `mark->flags`. I need to review its uses. |
| 32/2 | 2026/03/26 00:26 | tool | grepper |
0mArgs:map[Expression:FSNOTIFY_MARK_FLAG_IN_ONESHOT] Results: map[Output:fs/notify/inotify/inotify.h=32=static inline __u32 inotify_mark_user_mask(struct fsnotify_mark *fsn_mark)
--
fs/notify/inotify/inotify.h-37- mask |= IN_EXCL_UNLINK;
fs/notify/inotify/inotify.h:38: if (fsn_mark->flags & FSNOTIFY_MARK_FLAG_IN_ONESHOT)
fs/notify/inotify/inotify.h-39- mask |= IN_ONESHOT;
--
fs/notify/inotify/inotify_fsnotify.c=59=int inotify_handle_inode_event(struct fsnotify_mark *inode_mark, u32 mask,
--
fs/notify/inotify/inotify_fsnotify.c-131-
fs/notify/inotify/inotify_fsnotify.c:132: if (inode_mark->flags & FSNOTIFY_MARK_FLAG_IN_ONESHOT)
fs/notify/inotify/inotify_fsnotify.c-133- fsnotify_destroy_mark(inode_mark, group);
--
fs/notify/inotify/inotify_user.c=99=static inline __u32 inotify_arg_to_mask(struct inode *inode, u32 arg)
--
fs/notify/inotify/inotify_user.c-117-#define INOTIFY_MARK_FLAGS \
fs/notify/inotify/inotify_user.c:118: (FSNOTIFY_MARK_FLAG_EXCL_UNLINK | FSNOTIFY_MARK_FLAG_IN_ONESHOT)
fs/notify/inotify/inotify_user.c-119-
fs/notify/inotify/inotify_user.c=120=static inline unsigned int inotify_arg_to_flags(u32 arg)
--
fs/notify/inotify/inotify_user.c-126- if (arg & IN_ONESHOT)
fs/notify/inotify/inotify_user.c:127: flags |= FSNOTIFY_MARK_FLAG_IN_ONESHOT;
fs/notify/inotify/inotify_user.c-128-
--
include/linux/fsnotify_backend.h=607=struct fsnotify_mark {
--
include/linux/fsnotify_backend.h-632-#define FSNOTIFY_MARK_FLAG_EXCL_UNLINK 0x0010
include/linux/fsnotify_backend.h:633:#define FSNOTIFY_MARK_FLAG_IN_ONESHOT 0x0020
include/linux/fsnotify_backend.h-634- /* fanotify mark flags */
] |
| 33/2 | 2026/03/26 00:26 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 37051 output: 19 thoughts: 4002 Thoughts: **Benign Data Race in fsnotify: A Deep Dive** Okay, so KCSAN flagged a data race. Let's break this down systematically, as any good kernel developer would. We're looking at a race condition involving `mark->flags` within the filesystem notification framework, specifically between `fsnotify_free_mark` and `inotify_handle_inode_event`. The core issue appears to center around the `FSNOTIFY_MARK_FLAG_ALIVE` and `FSNOTIFY_MARK_FLAG_IN_ONESHOT` flags. The first is cleared during mark destruction, while the latter, crucially, *isn't*. The read happens in `inotify_handle_inode_event`, and the write in `fsnotify_free_mark`. Initially, my concern was the possibility of a 32-bit read seeing a partially written 32-bit value on x86, somehow losing the `IN_ONESHOT` flag, but that's impossible. Checking the code confirms the `flags` field is an `unsigned int` (32 bits), so my original premise was correct. The report also shows `value changed: 0x00000001 -> 0x00000000`. This means `IN_ONESHOT` was never set on this mark. If it *wasn't* set, the read returns 0 anyway, so why the report? It's still a data race, even if benign. The compiler could reorder. A key question is whether a oneshot mark *can* be destroyed by *both* `inotify_rm_watch` and `inotify_handle_inode_event`. Yes it can; those paths are possible, so the question is, can they break things? The good news is that `inotify_handle_inode_event` calls `fsnotify_destroy_mark` on `IN_ONESHOT`, but `fsnotify_destroy_mark` is safe to call repeatedly. It checks for `FSNOTIFY_MARK_FLAG_ATTACHED` and `FSNOTIFY_MARK_FLAG_ALIVE`, so it returns immediately if the mark is already being destroyed. The memory itself is protected too. The `mark` is an SRCU protected memory block, ensuring the memory backing `inode_mark` is still valid. Further inspection reveals that the "race" is on the same 32-bit word, with different bits being modified. While concurrent modification of a bitfield is a data race, in this case, all modifications are under `mark->lock`, *except* the read in `inotify_handle_inode_event`. The read is outside the lock. However, there are no other writes without the lock. It is important to determine whether `fsnotify_destroy_mark` could be called multiple times for a single mark. `fsnotify_destroy_mark` calls both `fsnotify_detach_mark` and `fsnotify_free_mark`, but *both* functions contain checks for `ATTACHED` and `ALIVE` under `mark->lock`. Thus, only one of the threads will win in the case of multiple calls to those functions. The key observation is that the read at line 132 in `inotify_handle_inode_event` *only* determines whether `fsnotify_destroy_mark` is called. It does not lead to any memory corruption or other harmful behavior. The flag `FSNOTIFY_MARK_FLAG_ATTACHED` is also cleared in `fsnotify_detach_mark`, but that function is called *before* `fsnotify_free_mark`. This ensures that even if `inotify_handle_inode_event` sees `IN_ONESHOT` and calls `fsnotify_destroy_mark`, the destruction process is idempotent. We also see that `fsnotify_destroy_mark` protects the group’s mark list. So, it is safe even if multiple threads are calling it. The real key is that the value change reported (`1 -> 0`) means `ALIVE` was set, and is being cleared. This would mean that the value read at line 132 would remain `0`, whether `IN_ONESHOT` was set or not. Even if `IN_ONESHOT` *is* set, the check, and the call to `fsnotify_destroy_mark`, will still be correct. In short, even though this *is* a data race (different bits modified on the same word), it's benign because the code is structured in a way that handles those parallel reads and writes without introducing any crash. There will be no change in behavior due to the read, even in the "worst case". `wd` is handled properly by |
| 34/2 | 2026/03/26 00:27 | tool | set-results |
0mArgs:map[Benign:true Confident:true] Results: map[Benign:true Confident:true] |
| 35/2 | 2026/03/26 00:27 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 42114 output: 797 thoughts: 370 Thoughts: **My Assessment of the Data Race** Okay, the analysis is done, and here's what I've got. The data race is definitely focused on `inode_mark->flags`. The write operation occurs within `fsnotify_free_mark` when it clears `FSNOTIFY_MARK_FLAG_ALIVE`. The read happens in `inotify_handle_inode_event` where it's checking the `FSNOTIFY_MARK_FLAG_IN_ONESHOT` bit. Both operations target different bits within the same `unsigned int`, so they are essentially working at the bit level. The write is guarded by `mark->lock`, which is good. But the read...that's the problem; it isn't protected. However, after careful consideration, I'm comfortable saying this is a *benign* race. Here's why: 1. On a 64-bit system, the atomic 32-bit accesses mean the distinct bits don't interfere with each other, so we don't need to be concerned about corruption. 2. `FSNOTIFY_MARK_FLAG_IN_ONESHOT` is set at the time of creation and is never cleared. So, if the read ever checks for the `FSNOTIFY_MARK_FLAG_IN_ONESHOT`, it'll be set, no matter what. 3. The `fsnotify_destroy_mark` function is designed to be called multiple times for the same mark, so the race itself won't cause problems with memory management. 4. The SRCU mechanism keeps the mark alive during event handling, adding another layer of safety. 5. Critically, this race doesn't lead to memory corruption or crashes. That's a huge win. The report highlights a `value changed: 0x00000001 -> 0x00000000`, so we're seeing `ALIVE` getting cleared and `IN_ONESHOT` was initially 0. In this specific scenario, the check `flags & FSNOTIFY_MARK_FLAG_IN_ONESHOT` would evaluate to 0 both before and after the write. Alternatively, if `IN_ONESHOT` was initially 1, the check would also return true both before and after the write. The overall behaviour isn't impacted by this. I'm confident in this conclusion. |
| 36/1 | 2026/03/26 00:27 | action | format-explanation |
0mResults:map[Explanation:The data race occurs on the `flags` field of `struct fsnotify_mark`.
Specifically, it is a race between a write to clear the
`FSNOTIFY_MARK_FLAG_ALIVE` bit (bit 0) in `fsnotify_free_mark()` and a read of
the `FSNOTIFY_MARK_FLAG_IN_ONESHOT` bit (bit 5) in
`inotify_handle_inode_event()`.
### Analysis of the Race
1. **Write Access**: In `fsnotify_free_mark()`, the code clears the
`FSNOTIFY_MARK_FLAG_ALIVE` flag under the protection of `mark->lock`:
```c spin_lock(&mark->lock); ... mark->flags &= ~FSNOTIFY_MARK_FLAG_ALIVE;
spin_unlock(&mark->lock); ``` This is called during mark destruction (e.g.,
via `inotify_rm_watch()`).
2. **Read Access**: In `inotify_handle_inode_event()`, the code checks if the
mark is a "oneshot" mark to decide whether to destroy it after the event is
handled:
```c if (inode_mark->flags & FSNOTIFY_MARK_FLAG_IN_ONESHOT)
fsnotify_destroy_mark(inode_mark, group);
``` This read is performed without holding `mark->lock`.
### Why the Race is Benign
* **Different Bits**: The bits being accessed are different (`0x0001` vs
`0x0020`). On a 64-bit system, a 32-bit `unsigned int` access is atomic at the
word level. A write to one bit will not cause a concurrent read of another bit
in the same word to return an incorrect value, provided the compiler does not
perform unusual optimizations.
* **Flag Stability**: The `FSNOTIFY_MARK_FLAG_IN_ONESHOT` flag is set when the
inotify watch is created and is never cleared during the lifetime of the mark.
Therefore, the read in `inotify_handle_inode_event()` will always see the
correct value for this bit, regardless of whether `FSNOTIFY_MARK_FLAG_ALIVE` is
being cleared simultaneously.
* **Idempotency of Destruction**: If the race were to cause
`inotify_handle_inode_event()` to call `fsnotify_destroy_mark()` while another
thread is already destroying the mark, it is safe. `fsnotify_destroy_mark()` and
its sub-functions (`fsnotify_detach_mark()` and `fsnotify_free_mark()`) contain
internal checks (protected by locks) to ensure that detachment and freeing only
happen once.
* **Object Lifetime**: The `fsnotify_mark` object is kept alive during event
handling by SRCU (and potentially a reference count), so the memory access
itself is safe from use-after-free.
* **No Functional Impact**: As shown in the KCSAN report (`value changed:
0x00000001 -> 0x00000000`), the `IN_ONESHOT` bit was not even set in this
instance. Even if it were, the logic would remain correct as the `IN_ONESHOT`
bit remains stable.
In conclusion, this is a benign data race on a flags field that does not lead to
memory corruption, crashes, or incorrect logic. While using
`READ_ONCE()`/`WRITE_ONCE()` or `data_race()` would satisfy KCSAN, the current
code is functionally correct.] |
| Total Calls | Total Tokens | Avg Tokens | Total Duration (Seconds) | Avg Duration (Seconds) |
|---|
| Total Calls | Total Duration (Seconds) | Avg Duration (Seconds) |
|---|