syzbot |
sign-in | mailing list | source | docs | 🏰 |
| ID | Workflow | Result | Correct | Bug | Created | Started | Finished | Revision | Error |
|---|---|---|---|---|---|---|---|---|---|
| f45266b7-2b9d-40ce-a33c-eaa16a2928af | assessment-kcsan | Benign: ✅ Confident: ✅ | ❓ | KCSAN: data-race in xas_clear_mark / xas_find_marked (2) | 2026/02/24 23:23 | 2026/02/24 23:23 | 2026/02/24 23:26 | 305c0ec5cd886e2d13738e28e1b2df9b0ec20fc9 |
The data race occurs between `xas_clear_mark()` and `xas_find_marked()` in `lib/xarray.c`.
### Analysis of the Race
1. **Writer Side (`xas_clear_mark`)**:
* The writer is executing `__folio_start_writeback()`, which calls `xas_lock_irqsave()` to acquire the XArray spinlock before calling `xas_clear_mark()`.
* `xas_clear_mark()` calls `node_clear_mark()`, which uses `___test_and_clear_bit()`. This is a non-atomic bitwise operation (on x86, it translates to a `btr` instruction without the `LOCK` prefix).
* The writer holds the spinlock, so it is the only writer to this XArray node at the time.
2. **Reader Side (`xas_find_marked`)**:
* The reader is executing `filemap_get_folios_tag()`, which uses `rcu_read_lock()` for lockless traversal of the page cache.
* `xas_find_marked()` calls `xas_find_chunk()`, which performs a plain load of the mark bitmap: `unsigned long data = *addr & (~0UL << offset);`.
* Since the reader does not hold the spinlock, it can race with the writer.
### Why the Race is Benign
* **Atomicity on x86_64**: On the x86_64 architecture (where this was reported), naturally aligned 64-bit loads and stores are atomic. Although `btr` is a read-modify-write instruction, the final write back to memory is a single 64-bit store. A reader on another CPU will see either the value before the bit was cleared or the value after. It will not see a "torn" value where only some bits are updated.
* **Robustness of XArray Logic**: The XArray is designed for RCU-protected lookups. If `xas_find_marked()` sees the old value (bit set), it returns the entry. If it sees the new value (bit cleared), it skips it. Both are valid states during a concurrent clear operation.
* **Higher-Level Verification**: In the case of page cache writeback (the context of this race), the caller `mpage_prepare_extent_to_map()` locks the folio returned by `filemap_get_folios_tag()` and re-checks its state (e.g., `folio_test_dirty()`, `folio_test_writeback()`). If the folio's state changed during the race, the higher-level logic handles it correctly (e.g., by skipping the folio if it's already under writeback).
* **No Risk of Corruption**: The race is on a bitmap used for marking/tagging, not on pointers or structural tree links. A race here might cause a reader to briefly see a stale mark or miss a mark being set, but it cannot lead to memory corruption or crashes.
In summary, this is a classic example of an intentional data race in a lockless data structure where the implementation relies on architecture-specific atomicity of aligned word accesses and robust higher-level logic to handle transient inconsistencies.
Final Verdict: Benign.EXT4-fs (loop9): encrypted files will use data=ordered instead of data journaling mode EXT4-fs (loop9): 1 truncate cleaned up EXT4-fs (loop9): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none. ================================================================== BUG: KCSAN: data-race in xas_clear_mark / xas_find_marked read-write to 0xffff8881077576c0 of 8 bytes by task 13171 on cpu 1: instrument_read_write include/linux/instrumented.h:56 [inline] __instrument_read_write_bitop include/asm-generic/bitops/instrumented-non-atomic.h:84 [inline] ___test_and_clear_bit include/asm-generic/bitops/instrumented-non-atomic.h:114 [inline] node_clear_mark lib/xarray.c:102 [inline] xas_clear_mark+0x91/0x180 lib/xarray.c:922 __folio_start_writeback+0x270/0x370 mm/page-writeback.c:3011 ext4_bio_write_folio+0x5ad/0x9f0 fs/ext4/page-io.c:584 mpage_submit_folio fs/ext4/inode.c:2080 [inline] mpage_process_page_bufs+0x4a1/0x620 fs/ext4/inode.c:2191 mpage_prepare_extent_to_map+0x7d4/0xc50 fs/ext4/inode.c:2730 ext4_do_writepages+0x70f/0x2840 fs/ext4/inode.c:2871 ext4_writepages+0x18f/0x320 fs/ext4/inode.c:3019 do_writepages+0x1c6/0x310 mm/page-writeback.c:2554 filemap_writeback mm/filemap.c:387 [inline] filemap_fdatawrite_range mm/filemap.c:412 [inline] filemap_write_and_wait_range+0x166/0x370 mm/filemap.c:684 filemap_write_and_wait include/linux/pagemap.h:65 [inline] ext4_ioctl_setflags fs/ext4/ioctl.c:648 [inline] ext4_fileattr_set+0x4e6/0xe40 fs/ext4/ioctl.c:1027 vfs_fileattr_set+0x666/0x6b0 fs/file_attr.c:301 ioctl_setflags+0x14b/0x180 fs/file_attr.c:337 do_vfs_ioctl+0x661/0xe70 fs/ioctl.c:560 __do_sys_ioctl fs/ioctl.c:595 [inline] __se_sys_ioctl+0x82/0x140 fs/ioctl.c:583 __x64_sys_ioctl+0x43/0x50 fs/ioctl.c:583 x64_sys_call+0x1563/0x3020 arch/x86/include/generated/asm/syscalls_64.h:17 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff8881077576c0 of 8 bytes by task 13164 on cpu 0: xas_find_chunk include/linux/xarray.h:1752 [inline] xas_find_marked+0x213/0x620 lib/xarray.c:1510 find_get_entry mm/filemap.c:2064 [inline] filemap_get_folios_tag+0xfa/0x510 mm/filemap.c:2332 mpage_prepare_extent_to_map+0x328/0xc50 fs/ext4/inode.c:2632 ext4_do_writepages+0x70f/0x2840 fs/ext4/inode.c:2871 ext4_writepages+0x18f/0x320 fs/ext4/inode.c:3019 do_writepages+0x1c6/0x310 mm/page-writeback.c:2554 filemap_writeback mm/filemap.c:387 [inline] filemap_fdatawrite_range mm/filemap.c:412 [inline] file_write_and_wait_range+0x178/0x2f0 mm/filemap.c:786 generic_buffers_fsync_noflush+0x45/0x130 fs/buffer.c:607 ext4_fsync_nojournal fs/ext4/fsync.c:88 [inline] ext4_sync_file+0x1aa/0x6b0 fs/ext4/fsync.c:147 vfs_fsync_range+0xc5/0xe0 fs/sync.c:186 generic_write_sync include/linux/fs.h:2640 [inline] ext4_buffered_write_iter+0x34f/0x3c0 fs/ext4/file.c:306 ext4_file_write_iter+0x380/0xfa0 fs/ext4/file.c:-1 new_sync_write fs/read_write.c:595 [inline] vfs_write+0x5a6/0x9f0 fs/read_write.c:688 ksys_write+0xdc/0x1a0 fs/read_write.c:740 __do_sys_write fs/read_write.c:751 [inline] __se_sys_write fs/read_write.c:748 [inline] __x64_sys_write+0x40/0x50 fs/read_write.c:748 x64_sys_call+0x27e1/0x3020 arch/x86/include/generated/asm/syscalls_64.h:2 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x003fffff00000000 -> 0x003ffff000000000 Reported by Kernel Concurrency Sanitizer on: CPU: 0 UID: 0 PID: 13164 Comm: syz.9.13498 Tainted: G W syzkaller #0 PREEMPT(full) Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026 ==================================================================
| Seq | Timestamp | Type | Name | Duration |
|---|---|---|---|---|
| 0/0 | 2026/02/24 23:23 | flow | assessment-kcsan |
2mResults:map[Benign:true Confident:true Explanation:The data race occurs between `xas_clear_mark()` and `xas_find_marked()` in `lib/xarray.c`.
### Analysis of the Race
1. **Writer Side (`xas_clear_mark`)**:
* The writer is executing `__folio_start_writeback()`, which calls `xas_lock_irqsave()` to acquire the XArray spinlock before calling `xas_clear_mark()`.
* `xas_clear_mark()` calls `node_clear_mark()`, which uses `___test_and_clear_bit()`. This is a non-atomic bitwise operation (on x86, it translates to a `btr` instruction without the `LOCK` prefix).
* The writer holds the spinlock, so it is the only writer to this XArray node at the time.
2. **Reader Side (`xas_find_marked`)**:
* The reader is executing `filemap_get_folios_tag()`, which uses `rcu_read_lock()` for lockless traversal of the page cache.
* `xas_find_marked()` calls `xas_find_chunk()`, which performs a plain load of the mark bitmap: `unsigned long data = *addr & (~0UL << offset);`.
* Since the reader does not hold the spinlock, it can race with the writer.
### Why the Race is Benign
* **Atomicity on x86_64**: On the x86_64 architecture (where this was reported), naturally aligned 64-bit loads and stores are atomic. Although `btr` is a read-modify-write instruction, the final write back to memory is a single 64-bit store. A reader on another CPU will see either the value before the bit was cleared or the value after. It will not see a "torn" value where only some bits are updated.
* **Robustness of XArray Logic**: The XArray is designed for RCU-protected lookups. If `xas_find_marked()` sees the old value (bit set), it returns the entry. If it sees the new value (bit cleared), it skips it. Both are valid states during a concurrent clear operation.
* **Higher-Level Verification**: In the case of page cache writeback (the context of this race), the caller `mpage_prepare_extent_to_map()` locks the folio returned by `filemap_get_folios_tag()` and re-checks its state (e.g., `folio_test_dirty()`, `folio_test_writeback()`). If the folio's state changed during the race, the higher-level logic handles it correctly (e.g., by skipping the folio if it's already under writeback).
* **No Risk of Corruption**: The race is on a bitmap used for marking/tagging, not on pointers or structural tree links. A race here might cause a reader to briefly see a stale mark or miss a mark being set, but it cannot lead to memory corruption or crashes.
In summary, this is a classic example of an intentional data race in a lockless data structure where the implementation relies on architecture-specific atomicity of aligned word accesses and robust higher-level logic to handle transient inconsistencies.
Final Verdict: Benign.] |
| 1/1 | 2026/02/24 23:23 | action | kernel-checkouter |
0mResults:map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/ae13d14ba43c9fbd1af73b1b694531ed5ba46e0b] |
| 2/1 | 2026/02/24 23:23 | action | kernel-builder |
0mResults:map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/87d68aea78e227fd8bb253ce6d589d22d1c91020] |
| 3/1 | 2026/02/24 23:23 | action | codesearch-prepare |
0mResults:map[Index:codesearch-index] |
| 4/1 | 2026/02/24 23:23 | agent | expert |
2mModel:gemini-3-flash-preview Results: map[Benign:true Confident:true] Instruction: You are an experienced Linux kernel developer tasked with determining if the given kernel data race is benign or not. The data race report is from KCSAN tool. It contains 2 stack traces of the memory accesses that constitute a data race. A "benign" data races are on a simple int/bool variable or similar field, and the accesses are not supposed to be protected by any mutual exclusion primitives. Common examples of such "benign" data races are accesses to various flags fields, statistics counters, and similar. A "benign" data race does not lead to memory corruption/crash with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A non-benign (or "harmful" data race) can lead to corruption/crash even with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A common example of a "harmful" data races is race on a complex container (list/hashmap/etc), where accesses are supposed to be protected by a mutual exclusion primitive. Also consider races that happen at the same time with the given one. If there is no synchronization in between, other memory accesses in the involved threads race with each other if they access the same memory. For example, if both threads execute: some_struct->have_elements = true; list_add(new_node, &some_struct->list_head); the race on some_struct->have_elements may appear benign, however it also implies there is a race on some_struct->list_head which is not benign, since the list is not thread-safe. Take into account that on 32-bit systems 64-bit memory accesses may be split into two accesses, and thus even with a conservative compiler may not be fully atomic. However, such races may still be benign depending on how writes are done, and how read data is used. In the final reply explain why you think the given data race is benign or is harmful. Use the provided tools to confirm any assumptions, variables/fields being accessed, etc. In particular, don't make assumptions about the kernel source code, use codesearch tools to read the actual source code. Prefer calling several tools at the same time to save round-trips. Use set-results tool to provide results of the analysis. It must be called exactly once before the final reply. Ignore results of this tool. Prompt: The data race report is: EXT4-fs (loop9): encrypted files will use data=ordered instead of data journaling mode EXT4-fs (loop9): 1 truncate cleaned up EXT4-fs (loop9): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none. ================================================================== BUG: KCSAN: data-race in xas_clear_mark / xas_find_marked read-write to 0xffff8881077576c0 of 8 bytes by task 13171 on cpu 1: instrument_read_write include/linux/instrumented.h:56 [inline] __instrument_read_write_bitop include/asm-generic/bitops/instrumented-non-atomic.h:84 [inline] ___test_and_clear_bit include/asm-generic/bitops/instrumented-non-atomic.h:114 [inline] node_clear_mark lib/xarray.c:102 [inline] xas_clear_mark+0x91/0x180 lib/xarray.c:922 __folio_start_writeback+0x270/0x370 mm/page-writeback.c:3011 ext4_bio_write_folio+0x5ad/0x9f0 fs/ext4/page-io.c:584 mpage_submit_folio fs/ext4/inode.c:2080 [inline] mpage_process_page_bufs+0x4a1/0x620 fs/ext4/inode.c:2191 mpage_prepare_extent_to_map+0x7d4/0xc50 fs/ext4/inode.c:2730 ext4_do_writepages+0x70f/0x2840 fs/ext4/inode.c:2871 ext4_writepages+0x18f/0x320 fs/ext4/inode.c:3019 do_writepages+0x1c6/0x310 mm/page-writeback.c:2554 filemap_writeback mm/filemap.c:387 [inline] filemap_fdatawrite_range mm/filemap.c:412 [inline] filemap_write_and_wait_range+0x166/0x370 mm/filemap.c:684 filemap_write_and_wait include/linux/pagemap.h:65 [inline] ext4_ioctl_setflags fs/ext4/ioctl.c:648 [inline] ext4_fileattr_set+0x4e6/0xe40 fs/ext4/ioctl.c:1027 vfs_fileattr_set+0x666/0x6b0 fs/file_attr.c:301 ioctl_setflags+0x14b/0x180 fs/file_attr.c:337 do_vfs_ioctl+0x661/0xe70 fs/ioctl.c:560 __do_sys_ioctl fs/ioctl.c:595 [inline] __se_sys_ioctl+0x82/0x140 fs/ioctl.c:583 __x64_sys_ioctl+0x43/0x50 fs/ioctl.c:583 x64_sys_call+0x1563/0x3020 arch/x86/include/generated/asm/syscalls_64.h:17 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff8881077576c0 of 8 bytes by task 13164 on cpu 0: xas_find_chunk include/linux/xarray.h:1752 [inline] xas_find_marked+0x213/0x620 lib/xarray.c:1510 find_get_entry mm/filemap.c:2064 [inline] filemap_get_folios_tag+0xfa/0x510 mm/filemap.c:2332 mpage_prepare_extent_to_map+0x328/0xc50 fs/ext4/inode.c:2632 ext4_do_writepages+0x70f/0x2840 fs/ext4/inode.c:2871 ext4_writepages+0x18f/0x320 fs/ext4/inode.c:3019 do_writepages+0x1c6/0x310 mm/page-writeback.c:2554 filemap_writeback mm/filemap.c:387 [inline] filemap_fdatawrite_range mm/filemap.c:412 [inline] file_write_and_wait_range+0x178/0x2f0 mm/filemap.c:786 generic_buffers_fsync_noflush+0x45/0x130 fs/buffer.c:607 ext4_fsync_nojournal fs/ext4/fsync.c:88 [inline] ext4_sync_file+0x1aa/0x6b0 fs/ext4/fsync.c:147 vfs_fsync_range+0xc5/0xe0 fs/sync.c:186 generic_write_sync include/linux/fs.h:2640 [inline] ext4_buffered_write_iter+0x34f/0x3c0 fs/ext4/file.c:306 ext4_file_write_iter+0x380/0xfa0 fs/ext4/file.c:-1 new_sync_write fs/read_write.c:595 [inline] vfs_write+0x5a6/0x9f0 fs/read_write.c:688 ksys_write+0xdc/0x1a0 fs/read_write.c:740 __do_sys_write fs/read_write.c:751 [inline] __se_sys_write fs/read_write.c:748 [inline] __x64_sys_write+0x40/0x50 fs/read_write.c:748 x64_sys_call+0x27e1/0x3020 arch/x86/include/generated/asm/syscalls_64.h:2 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x003fffff00000000 -> 0x003ffff000000000 Reported by Kernel Concurrency Sanitizer on: CPU: 0 UID: 0 PID: 13164 Comm: syz.9.13498 Tainted: G W syzkaller #0 PREEMPT(full) Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026 ================================================================== Reply: The data race occurs between `xas_clear_mark()` and `xas_find_marked()` in `lib/xarray.c`.
### Analysis of the Race
1. **Writer Side (`xas_clear_mark`)**:
* The writer is executing `__folio_start_writeback()`, which calls `xas_lock_irqsave()` to acquire the XArray spinlock before calling `xas_clear_mark()`.
* `xas_clear_mark()` calls `node_clear_mark()`, which uses `___test_and_clear_bit()`. This is a non-atomic bitwise operation (on x86, it translates to a `btr` instruction without the `LOCK` prefix).
* The writer holds the spinlock, so it is the only writer to this XArray node at the time.
2. **Reader Side (`xas_find_marked`)**:
* The reader is executing `filemap_get_folios_tag()`, which uses `rcu_read_lock()` for lockless traversal of the page cache.
* `xas_find_marked()` calls `xas_find_chunk()`, which performs a plain load of the mark bitmap: `unsigned long data = *addr & (~0UL << offset);`.
* Since the reader does not hold the spinlock, it can race with the writer.
### Why the Race is Benign
* **Atomicity on x86_64**: On the x86_64 architecture (where this was reported), naturally aligned 64-bit loads and stores are atomic. Although `btr` is a read-modify-write instruction, the final write back to memory is a single 64-bit store. A reader on another CPU will see either the value before the bit was cleared or the value after. It will not see a "torn" value where only some bits are updated.
* **Robustness of XArray Logic**: The XArray is designed for RCU-protected lookups. If `xas_find_marked()` sees the old value (bit set), it returns the entry. If it sees the new value (bit cleared), it skips it. Both are valid states during a concurrent clear operation.
* **Higher-Level Verification**: In the case of page cache writeback (the context of this race), the caller `mpage_prepare_extent_to_map()` locks the folio returned by `filemap_get_folios_tag()` and re-checks its state (e.g., `folio_test_dirty()`, `folio_test_writeback()`). If the folio's state changed during the race, the higher-level logic handles it correctly (e.g., by skipping the folio if it's already under writeback).
* **No Risk of Corruption**: The race is on a bitmap used for marking/tagging, not on pointers or structural tree links. A race here might cause a reader to briefly see a stale mark or miss a mark being set, but it cannot lead to memory corruption or crashes.
In summary, this is a classic example of an intentional data race in a lockless data structure where the implementation relies on architecture-specific atomicity of aligned word accesses and robust higher-level logic to handle transient inconsistencies.
Final Verdict: Benign. |
| 5/2 | 2026/02/24 23:23 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 4999 output: 120 thoughts: 1401 Thoughts: **Analyzing the XArray Data Race** Okay, so the report indicates a data race within `lib/xarray.c` between `xas_clear_mark` and `xas_find_marked`. It's a read-write conflict on a mark within the XArray structure. Let me break this down. Task 13171 (CPU 1) is calling `xas_clear_mark` indirectly through `__folio_start_writeback`. This means a folio is likely being written back to disk, and the associated mark is being cleared. Meanwhile, Task 13164 (CPU 0) is calling `xas_find_marked` via `filemap_get_folios_tag`. This implies a read operation to find folios with a particular tag, which involves checking for set marks. The address involved is `0xffff8881077576c0`. Now, the crucial part: the value at that address changed from `0x003fffff00000000` to `0x003ffff000000000`. Let's dig into that bit pattern, which is how I start to see these things. Let me examine the binary representation of the two values. `0x003fffff00000000` is essentially `3fffff` shifted left by 32 bits and `0x003ffff000000000` is essentially `3ffff` shifted left by 36 bits. The first value is `0000 0000 0011 1111 1111 1111 1111 1111 0000 0000 0000 0000 0000 0000 0000 0000` and the second is `0000 0000 0011 1111 1111 1111 1111 0000 0000 0000 0000 0000 0000 0000 0000 0000`. So it looks like a bit was *cleared*. I can see that an `f` changed to a `0`. The second value has one more zero bit than the first. In any case, it's clear: `xas_clear_mark` is in the process of clearing a bit or bits, and `xas_find_marked` is attempting to read them. Now, let's look at the functions themselves. `xas_clear_mark` calls `node_clear_mark`, and `node_clear_mark` seems to be using `___test_and_clear_bit`. This is a crucial detail because it suggests an atomic operation, but the race is still happening. `xas_find_marked` is ultimately calling `xas_find_chunk`. And, as I suspected, XArray marks are stored as bitmaps within the `xa_node`. The race is specifically between clearing a bit and the read of it. `xas_find_marked` *should* be running under RCU or with some other synchronization primitive, but perhaps it isn't. The `filemap_get_folios_tag` call is likely using RCU, which means `xas_find_marked` should normally be safe. This is strange. The race might be a result of the memory barrier issues or issues with the order of operations. I'll need to double-check the implementations of `xas_find_marked` and `xas_clear_mark` very carefully, focusing on the synchronization mechanisms and the exact timing of the read and clear operations. There could be subtle issues with how the RCU is being used here or a missed barrier. The fact that an atomic operation is involved makes this all the more interesting; perhaps there's a problem with the atomic operation not being applied as intended. |
| 6/2 | 2026/02/24 23:23 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:lib/xarray.c IncludeLines:true Name:xas_clear_mark] Results: map[SourceCode: 904: /**
905: * xas_clear_mark() - Clears the mark on this entry and its parents.
906: * @xas: XArray operation state.
907: * @mark: Mark number.
908: *
909: * Clears the specified mark on this entry, and walks back to the head
910: * attempting to clear it on all the ancestor entries. Does nothing if
911: * @xas has not been walked to an entry, or is in an error state.
912: */
913: void xas_clear_mark(const struct xa_state *xas, xa_mark_t mark)
914: {
915: struct xa_node *node = xas->xa_node;
916: unsigned int offset = xas->xa_offset;
917:
918: if (xas_invalid(xas))
919: return;
920:
921: while (node) {
922: if (!node_clear_mark(node, offset, mark))
923: return;
924: if (node_any_mark(node, mark))
925: return;
926:
927: offset = node->offset;
928: node = xa_parent_locked(xas->xa, node);
929: }
930:
931: if (xa_marked(xas->xa, mark))
932: xa_mark_clear(xas->xa, mark);
933: }
SourceFile:lib/xarray.c] |
| 7/2 | 2026/02/24 23:23 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:lib/xarray.c IncludeLines:true Name:xas_find_marked] Results: map[SourceCode:1441: /**
1442: * xas_find_marked() - Find the next marked entry in the XArray.
1443: * @xas: XArray operation state.
1444: * @max: Highest index to return.
1445: * @mark: Mark number to search for.
1446: *
1447: * If the @xas has not yet been walked to an entry, return the marked entry
1448: * which has an index >= xas.xa_index. If it has been walked, the entry
1449: * currently being pointed at has been processed, and so we return the
1450: * first marked entry with an index > xas.xa_index.
1451: *
1452: * If no marked entry is found and the array is smaller than @max, @xas is
1453: * set to the bounds state and xas->xa_index is set to the smallest index
1454: * not yet in the array. This allows @xas to be immediately passed to
1455: * xas_store().
1456: *
1457: * If no entry is found before @max is reached, @xas is set to the restart
1458: * state.
1459: *
1460: * Return: The entry, if found, otherwise %NULL.
1461: */
1462: void *xas_find_marked(struct xa_state *xas, unsigned long max, xa_mark_t mark)
1463: {
1464: bool advance = true;
1465: unsigned int offset;
1466: void *entry;
1467:
1468: if (xas_error(xas))
1469: return NULL;
1470: if (xas->xa_index > max)
1471: goto max;
1472:
1473: if (!xas->xa_node) {
1474: xas->xa_index = 1;
1475: goto out;
1476: } else if (xas_top(xas->xa_node)) {
1477: advance = false;
1478: entry = xa_head(xas->xa);
1479: xas->xa_node = NULL;
1480: if (xas->xa_index > max_index(entry))
1481: goto out;
1482: if (!xa_is_node(entry)) {
1483: if (xa_marked(xas->xa, mark))
1484: return entry;
1485: xas->xa_index = 1;
1486: goto out;
1487: }
1488: xas->xa_node = xa_to_node(entry);
1489: xas->xa_offset = xas->xa_index >> xas->xa_node->shift;
1490: }
1491:
1492: while (xas->xa_index <= max) {
1493: if (unlikely(xas->xa_offset == XA_CHUNK_SIZE)) {
1494: xas->xa_offset = xas->xa_node->offset + 1;
1495: xas->xa_node = xa_parent(xas->xa, xas->xa_node);
1496: if (!xas->xa_node)
1497: break;
1498: advance = false;
1499: continue;
1500: }
1501:
1502: if (!advance) {
1503: entry = xa_entry(xas->xa, xas->xa_node, xas->xa_offset);
1504: if (xa_is_sibling(entry)) {
1505: xas->xa_offset = xa_to_sibling(entry);
1506: xas_move_index(xas, xas->xa_offset);
1507: }
1508: }
1509:
1510: offset = xas_find_chunk(xas, advance, mark);
1511: if (offset > xas->xa_offset) {
1512: advance = false;
1513: xas_move_index(xas, offset);
1514: /* Mind the wrap */
1515: if ((xas->xa_index - 1) >= max)
1516: goto max;
1517: xas->xa_offset = offset;
1518: if (offset == XA_CHUNK_SIZE)
1519: continue;
1520: }
1521:
1522: entry = xa_entry(xas->xa, xas->xa_node, xas->xa_offset);
1523: if (!entry && !(xa_track_free(xas->xa) && mark == XA_FREE_MARK))
1524: continue;
1525: if (xa_is_sibling(entry))
1526: continue;
1527: if (!xa_is_node(entry))
1528: return entry;
1529: xas->xa_node = xa_to_node(entry);
1530: xas_set_offset(xas);
1531: }
1532:
1533: out:
1534: if (xas->xa_index > max)
1535: goto max;
1536: return set_bounds(xas);
1537: max:
1538: xas->xa_node = XAS_RESTART;
1539: return NULL;
1540: }
SourceFile:lib/xarray.c] |
| 8/2 | 2026/02/24 23:23 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:lib/xarray.c IncludeLines:true Name:node_clear_mark] Results: map[SourceCode: 98: /* returns true if the bit was set */
99: static inline bool node_clear_mark(struct xa_node *node, unsigned int offset,
100: xa_mark_t mark)
101: {
102: return __test_and_clear_bit(offset, node_marks(node, mark));
103: }
SourceFile:lib/xarray.c] |
| 9/2 | 2026/02/24 23:23 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 9522 output: 38 thoughts: 7614 Thoughts: **Analyzing a Potential Data Race in the XArray Implementation** Okay, so I'm digging into this reported data race concerning the `xas_find_chunk` function, specifically within the `xarray.h` header file. The core of the problem seems to be the non-atomic nature of the `__test_and_clear_bit` function, which is used in `node_clear_mark`. My initial thought is that `xas_clear_mark`, where this function is called, usually operates under the protection of the XArray lock, while `xas_find_marked`, which reads from the same memory location, might be using an RCU lock, hence the potential race. I'm starting to investigate how these functions are used, and the context of their use in other parts of the kernel. I've looked at `__folio_start_writeback` and `filemap_get_folios_tag`. From looking at those two it's becoming clearer that `xas_clear_mark` is called within a locked region but `xas_find_marked` can be called from an RCU-protected region. It's the reader (using RCU), looking for an already cleared mark, that seems to be the problem, because if there's a writer (using the XArray lock), then we have the classic read-modify-write problem. I'm thinking that if `__test_and_clear_bit` isn't atomic, it could lead to the reader seeing an intermediate state of the memory location, thus incorrectly identifying an entry. Now, on most architectures, a non-atomic bit operation is a sequence of instructions. But the question is: can a reader "see" the write in process? On x86, isn't a single memory access atomic? I need to check Intel's Software Developer Manual. Ah, it says it's atomic to read/write a byte, word, or doubleword, _if_ it's aligned. Wait, a read-modify-write isn't atomic without a `lock` prefix. Okay, so `btr` is the x86 instruction used, and it's not atomic without a lock prefix, so there *is* a race. The reader could see the value before or after the bit is cleared. KCSAN is indeed right in reporting this as a potential problem. Now, is this race actually harmful? In RCU-protected lookups, is it acceptable for the reader to potentially see an older or newer state? As long as it doesn't cause a crash or an inconsistent state that the caller can't handle, it's generally fine. `filemap_get_folios_tag` has retry loops and checks to make sure the folio is still valid. So if the mark is cleared, it will still go through the writeback process. The concern with non-atomic bitops is often around multiple writers. In this case, there's only one writer holding the lock. The reader sees a potentially "torn" value. For the case of a naturally aligned 64-bit write, is a "torn" read possible? And in fact, KCSAN is complaining about `__instrument_read_write_bitop` and `___test_and_clear_bit`. If a bit is set to zero, is it fine if the reader catches the wrong value and needs to retry? Wait, if `xas_clear_mark` is clearing a mark, if the mark is already set when `xas_find_marked` sees it, `xas_find_marked` returns the marked entry. If it sees it as cleared, it skips it. That seems fine. However, XArray marks are also used for navigation in the tree, and the marks are the *only* marks used for tree traversal. Let's say `xas_find_marked` is traversing the tree. If the mark in the child is cleared after the parent check but before the child check, it won't find anything in that child. That could be problematic, but does `xas_find_marked` handle this correctly? Yes. If `xas_find_chunk` returns `XA_CHUNK_SIZE`, then `xas->xa_offset` is `XA_CHUNK_SIZE` and it goes up to the parent. The report highlights a specific value change. `0x003fffff00000000` changing to `0x003ffff000000000`. So it looks like multiple bits are being cleared. Let me check the number of bits again. Wait, I was misreading the hex. It's not a single bit, but multiple bits that are being cleared. Bits 32,33,34,35 are being cleared. Does `node_clear_ |
| 10/2 | 2026/02/24 23:24 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:lib/xarray.c IncludeLines:true Name:node_marks] Results: map[SourceCode: 80: static inline unsigned long *node_marks(struct xa_node *node, xa_mark_t mark)
81: {
82: return node->marks[(__force unsigned)mark];
83: }
SourceFile:lib/xarray.c] |
| 11/2 | 2026/02/24 23:24 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 18288 output: 23 thoughts: 176 Thoughts: **Analyzing a Potential Race Condition in XArray** Okay, so I'm looking at this `xas_find_chunk` function, and it seems like that dereference, `*addr`, is just a straightforward, plain load. That immediately makes me think, "race condition." The compiler *could* theoretically split that load, though it's pretty improbable with an `unsigned long`. However, KCSAN is *definitely* going to flag that as a problem since it's a plain load competing with a write. The "correct" solution, to satisfy KCSAN and to completely eliminate the risk of the compiler doing something sneaky with that load, is probably to just slap a `READ_ONCE` on it. That's the safe bet. But the real question is, is it *actually* going to be harmful? In the kernel, we often tolerate these sorts of races if they don't crash the system and the code can gracefully handle slightly out-of-date or inconsistent data. We're talking about XArray here, which is a *core* component, and performance is paramount. I need to balance correctness with the potential performance impact of using `READ_ONCE`. Alright, time to go back and take a fresh look at `xas_find_chunk` in `include/linux/xarray.h`. Let's see if we can really assess the risk here. |
| 12/2 | 2026/02/24 23:24 | tool | read-file |
0mArgs:map[File:include/linux/xarray.h] Results: map[Contents:/* SPDX-License-Identifier: GPL-2.0+ */
#ifndef _LINUX_XARRAY_H
#define _LINUX_XARRAY_H
/*
* eXtensible Arrays
* Copyright (c) 2017 Microsoft Corporation
* Author: Matthew Wilcox <willy@infradead.org>
*
* See Documentation/core-api/xarray.rst for how to use the XArray.
*/
#include <linux/bitmap.h>
#include <linux/bug.h>
#include <linux/compiler.h>
#include <linux/err.h>
#include <linux/gfp.h>
#include <linux/kconfig.h>
#include <linux/limits.h>
#include <linux/lockdep.h>
#include <linux/rcupdate.h>
#include <linux/sched/mm.h>
#include <linux/spinlock.h>
#include <linux/types.h>
struct list_lru;
/*
* The bottom two bits of the entry determine how the XArray interprets
* the contents:
*
* 00: Pointer entry
* 10: Internal entry
* x1: Value entry or tagged pointer
*
* Attempting to store internal entries in the XArray is a bug.
*
* Most internal entries are pointers to the next node in the tree.
* The following internal entries have a special meaning:
*
* 0-62: Sibling entries
* 256: Retry entry
* 257: Zero entry
*
* Errors are also represented as internal entries, but use the negative
* space (-4094 to -2). They're never stored in the slots array; only
* returned by the normal API.
*/
#define BITS_PER_XA_VALUE (BITS_PER_LONG - 1)
/**
* xa_mk_value() - Create an XArray entry from an integer.
* @v: Value to store in XArray.
*
* Context: Any context.
* Return: An entry suitable for storing in the XArray.
*/
static inline void *xa_mk_value(unsigned long v)
{
WARN_ON((long)v < 0);
return (void *)((v << 1) | 1);
}
/**
* xa_to_value() - Get value stored in an XArray entry.
* @entry: XArray entry.
*
* Context: Any context.
* Return: The value stored in the XArray entry.
*/
static inline unsigned long xa_to_value(const void *entry)
{
return (unsigned long)entry >> 1;
}
/**
* xa_is_value() - Determine if an entry is a value.
* @entry: XArray entry.
*
* Context: Any context.
* Return: True if the entry is a value, false if it is a pointer.
*/
static inline bool xa_is_value(const void *entry)
{
return (unsigned long)entry & 1;
}
/**
* xa_tag_pointer() - Create an XArray entry for a tagged pointer.
* @p: Plain pointer.
* @tag: Tag value (0, 1 or 3).
*
* If the user of the XArray prefers, they can tag their pointers instead
* of storing value entries. Three tags are available (0, 1 and 3).
* These are distinct from the xa_mark_t as they are not replicated up
* through the array and cannot be searched for.
*
* Context: Any context.
* Return: An XArray entry.
*/
static inline void *xa_tag_pointer(void *p, unsigned long tag)
{
return (void *)((unsigned long)p | tag);
}
/**
* xa_untag_pointer() - Turn an XArray entry into a plain pointer.
* @entry: XArray entry.
*
* If you have stored a tagged pointer in the XArray, call this function
* to get the untagged version of the pointer.
*
* Context: Any context.
* Return: A pointer.
*/
static inline void *xa_untag_pointer(void *entry)
{
return (void *)((unsigned long)entry & ~3UL);
}
/**
* xa_pointer_tag() - Get the tag stored in an XArray entry.
* @entry: XArray entry.
*
* If you have stored a tagged pointer in the XArray, call this function
* to get the tag of that pointer.
*
* Context: Any context.
* Return: A tag.
*/
static inline unsigned int xa_pointer_tag(void *entry)
{
return (unsigned long)entry & 3UL;
}
/*
* xa_mk_internal() - Create an internal entry.
* @v: Value to turn into an internal entry.
*
* Internal entries are used for a number of purposes. Entries 0-255 are
* used for sibling entries (only 0-62 are used by the current code). 256
* is used for the retry entry. 257 is used for the reserved / zero entry.
* Negative internal entries are used to represent errnos. Node pointers
* are also tagged as internal entries in some situations.
*
* Context: Any context.
* Return: An XArray internal entry corresponding to this value.
*/
static inline void *xa_mk_internal(unsigned long v)
{
return (void *)((v << 2) | 2);
}
/*
* xa_to_internal() - Extract the value from an internal entry.
* @entry: XArray entry.
*
* Context: Any context.
* Return: The value which was stored in the internal entry.
*/
static inline unsigned long xa_to_internal(const void *entry)
{
return (unsigned long)entry >> 2;
}
/*
* xa_is_internal() - Is the entry an internal entry?
* @entry: XArray entry.
*
* Context: Any context.
* Return: %true if the entry is an internal entry.
*/
static inline bool xa_is_internal(const void *entry)
{
return ((unsigned long)entry & 3) == 2;
}
#define XA_ZERO_ENTRY xa_mk_internal(257)
/**
* xa_is_zero() - Is the entry a zero entry?
* @entry: Entry retrieved from the XArray
*
* The normal API will return NULL as the contents of a slot containing
* a zero entry. You can only see zero entries by using the advanced API.
*
* Return: %true if the entry is a zero entry.
*/
static inline bool xa_is_zero(const void *entry)
{
return unlikely(entry == XA_ZERO_ENTRY);
}
/**
* xa_is_err() - Report whether an XArray operation returned an error
* @entry: Result from calling an XArray function
*
* If an XArray operation cannot complete an operation, it will return
* a special value indicating an error. This function tells you
* whether an error occurred; xa_err() tells you which error occurred.
*
* Context: Any context.
* Return: %true if the entry indicates an error.
*/
static inline bool xa_is_err(const void *entry)
{
return unlikely(xa_is_internal(entry) &&
entry >= xa_mk_internal(-MAX_ERRNO));
}
/**
* xa_err() - Turn an XArray result into an errno.
* @entry: Result from calling an XArray function.
*
* If an XArray operation cannot complete an operation, it will return
* a special pointer value which encodes an errno. This function extracts
* the errno from the pointer value, or returns 0 if the pointer does not
* represent an errno.
*
* Context: Any context.
* Return: A negative errno or 0.
*/
static inline int xa_err(void *entry)
{
/* xa_to_internal() would not do sign extension. */
if (xa_is_err(entry))
return (long)entry >> 2;
return 0;
}
/**
* struct xa_limit - Represents a range of IDs.
* @min: The lowest ID to allocate (inclusive).
* @max: The maximum ID to allocate (inclusive).
*
* This structure is used either directly or via the XA_LIMIT() macro
* to communicate the range of IDs that are valid for allocation.
* Three common ranges are predefined for you:
* * xa_limit_32b - [0 - UINT_MAX]
* * xa_limit_31b - [0 - INT_MAX]
* * xa_limit_16b - [0 - USHRT_MAX]
*/
struct xa_limit {
u32 max;
u32 min;
};
#define XA_LIMIT(_min, _max) (struct xa_limit) { .min = _min, .max = _max }
#define xa_limit_32b XA_LIMIT(0, UINT_MAX)
#define xa_limit_31b XA_LIMIT(0, INT_MAX)
#define xa_limit_16b XA_LIMIT(0, USHRT_MAX)
typedef unsigned __bitwise xa_mark_t;
#define XA_MARK_0 ((__force xa_mark_t)0U)
#define XA_MARK_1 ((__force xa_mark_t)1U)
#define XA_MARK_2 ((__force xa_mark_t)2U)
#define XA_PRESENT ((__force xa_mark_t)8U)
#define XA_MARK_MAX XA_MARK_2
#define XA_FREE_MARK XA_MARK_0
enum xa_lock_type {
XA_LOCK_IRQ = 1,
XA_LOCK_BH = 2,
};
/*
* Values for xa_flags. The radix tree stores its GFP flags in the xa_flags,
* and we remain compatible with that.
*/
#define XA_FLAGS_LOCK_IRQ ((__force gfp_t)XA_LOCK_IRQ)
#define XA_FLAGS_LOCK_BH ((__force gfp_t)XA_LOCK_BH)
#define XA_FLAGS_TRACK_FREE ((__force gfp_t)4U)
#define XA_FLAGS_ZERO_BUSY ((__force gfp_t)8U)
#define XA_FLAGS_ALLOC_WRAPPED ((__force gfp_t)16U)
#define XA_FLAGS_ACCOUNT ((__force gfp_t)32U)
#define XA_FLAGS_MARK(mark) ((__force gfp_t)((1U << __GFP_BITS_SHIFT) << \
(__force unsigned)(mark)))
/* ALLOC is for a normal 0-based alloc. ALLOC1 is for an 1-based alloc */
#define XA_FLAGS_ALLOC (XA_FLAGS_TRACK_FREE | XA_FLAGS_MARK(XA_FREE_MARK))
#define XA_FLAGS_ALLOC1 (XA_FLAGS_TRACK_FREE | XA_FLAGS_ZERO_BUSY)
/**
* struct xarray - The anchor of the XArray.
* @xa_lock: Lock that protects the contents of the XArray.
*
* To use the xarray, define it statically or embed it in your data structure.
* It is a very small data structure, so it does not usually make sense to
* allocate it separately and keep a pointer to it in your data structure.
*
* You may use the xa_lock to protect your own data structures as well.
*/
/*
* If all of the entries in the array are NULL, @xa_head is a NULL pointer.
* If the only non-NULL entry in the array is at index 0, @xa_head is that
* entry. If any other entry in the array is non-NULL, @xa_head points
* to an @xa_node.
*/
struct xarray {
spinlock_t xa_lock;
/* private: The rest of the data structure is not to be used directly. */
gfp_t xa_flags;
void __rcu * xa_head;
};
#define XARRAY_INIT(name, flags) { \
.xa_lock = __SPIN_LOCK_UNLOCKED(name.xa_lock), \
.xa_flags = flags, \
.xa_head = NULL, \
}
/**
* DEFINE_XARRAY_FLAGS() - Define an XArray with custom flags.
* @name: A string that names your XArray.
* @flags: XA_FLAG values.
*
* This is intended for file scope definitions of XArrays. It declares
* and initialises an empty XArray with the chosen name and flags. It is
* equivalent to calling xa_init_flags() on the array, but it does the
* initialisation at compiletime instead of runtime.
*/
#define DEFINE_XARRAY_FLAGS(name, flags) \
struct xarray name = XARRAY_INIT(name, flags)
/**
* DEFINE_XARRAY() - Define an XArray.
* @name: A string that names your XArray.
*
* This is intended for file scope definitions of XArrays. It declares
* and initialises an empty XArray with the chosen name. It is equivalent
* to calling xa_init() on the array, but it does the initialisation at
* compiletime instead of runtime.
*/
#define DEFINE_XARRAY(name) DEFINE_XARRAY_FLAGS(name, 0)
/**
* DEFINE_XARRAY_ALLOC() - Define an XArray which allocates IDs starting at 0.
* @name: A string that names your XArray.
*
* This is intended for file scope definitions of allocating XArrays.
* See also DEFINE_XARRAY().
*/
#define DEFINE_XARRAY_ALLOC(name) DEFINE_XARRAY_FLAGS(name, XA_FLAGS_ALLOC)
/**
* DEFINE_XARRAY_ALLOC1() - Define an XArray which allocates IDs starting at 1.
* @name: A string that names your XArray.
*
* This is intended for file scope definitions of allocating XArrays.
* See also DEFINE_XARRAY().
*/
#define DEFINE_XARRAY_ALLOC1(name) DEFINE_XARRAY_FLAGS(name, XA_FLAGS_ALLOC1)
void *xa_load(struct xarray *, unsigned long index);
void *xa_store(struct xarray *, unsigned long index, void *entry, gfp_t);
void *xa_erase(struct xarray *, unsigned long index);
void *xa_store_range(struct xarray *, unsigned long first, unsigned long last,
void *entry, gfp_t);
bool xa_get_mark(struct xarray *, unsigned long index, xa_mark_t);
void xa_set_mark(struct xarray *, unsigned long index, xa_mark_t);
void xa_clear_mark(struct xarray *, unsigned long index, xa_mark_t);
void *xa_find(struct xarray *xa, unsigned long *index,
unsigned long max, xa_mark_t) __attribute__((nonnull(2)));
void *xa_find_after(struct xarray *xa, unsigned long *index,
unsigned long max, xa_mark_t) __attribute__((nonnull(2)));
unsigned int xa_extract(struct xarray *, void **dst, unsigned long start,
unsigned long max, unsigned int n, xa_mark_t);
void xa_destroy(struct xarray *);
/**
* xa_init_flags() - Initialise an empty XArray with flags.
* @xa: XArray.
* @flags: XA_FLAG values.
*
* If you need to initialise an XArray with special flags (eg you need
* to take the lock from interrupt context), use this function instead
* of xa_init().
*
* Context: Any context.
*/
static inline void xa_init_flags(struct xarray *xa, gfp_t flags)
{
spin_lock_init(&xa->xa_lock);
xa->xa_flags = flags;
xa->xa_head = NULL;
}
/**
* xa_init() - Initialise an empty XArray.
* @xa: XArray.
*
* An empty XArray is full of NULL entries.
*
* Context: Any context.
*/
static inline void xa_init(struct xarray *xa)
{
xa_init_flags(xa, 0);
}
/**
* xa_empty() - Determine if an array has any present entries.
* @xa: XArray.
*
* Context: Any context.
* Return: %true if the array contains only NULL pointers.
*/
static inline bool xa_empty(const struct xarray *xa)
{
return xa->xa_head == NULL;
}
/**
* xa_marked() - Inquire whether any entry in this array has a mark set
* @xa: Array
* @mark: Mark value
*
* Context: Any context.
* Return: %true if any entry has this mark set.
*/
static inline bool xa_marked(const struct xarray *xa, xa_mark_t mark)
{
return xa->xa_flags & XA_FLAGS_MARK(mark);
}
/**
* xa_for_each_range() - Iterate over a portion of an XArray.
* @xa: XArray.
* @index: Index of @entry.
* @entry: Entry retrieved from array.
* @start: First index to retrieve from array.
* @last: Last index to retrieve from array.
*
* During the iteration, @entry will have the value of the entry stored
* in @xa at @index. You may modify @index during the iteration if you
* want to skip or reprocess indices. It is safe to modify the array
* during the iteration. At the end of the iteration, @entry will be set
* to NULL and @index will have a value less than or equal to max.
*
* xa_for_each_range() is O(n.log(n)) while xas_for_each() is O(n). You have
* to handle your own locking with xas_for_each(), and if you have to unlock
* after each iteration, it will also end up being O(n.log(n)).
* xa_for_each_range() will spin if it hits a retry entry; if you intend to
* see retry entries, you should use the xas_for_each() iterator instead.
* The xas_for_each() iterator will expand into more inline code than
* xa_for_each_range().
*
* Context: Any context. Takes and releases the RCU lock.
*/
#define xa_for_each_range(xa, index, entry, start, last) \
for (index = start, \
entry = xa_find(xa, &index, last, XA_PRESENT); \
entry; \
entry = xa_find_after(xa, &index, last, XA_PRESENT))
/**
* xa_for_each_start() - Iterate over a portion of an XArray.
* @xa: XArray.
* @index: Index of @entry.
* @entry: Entry retrieved from array.
* @start: First index to retrieve from array.
*
* During the iteration, @entry will have the value of the entry stored
* in @xa at @index. You may modify @index during the iteration if you
* want to skip or reprocess indices. It is safe to modify the array
* during the iteration. At the end of the iteration, @entry will be set
* to NULL and @index will have a value less than or equal to max.
*
* xa_for_each_start() is O(n.log(n)) while xas_for_each() is O(n). You have
* to handle your own locking with xas_for_each(), and if you have to unlock
* after each iteration, it will also end up being O(n.log(n)).
* xa_for_each_start() will spin if it hits a retry entry; if you intend to
* see retry entries, you should use the xas_for_each() iterator instead.
* The xas_for_each() iterator will expand into more inline code than
* xa_for_each_start().
*
* Context: Any context. Takes and releases the RCU lock.
*/
#define xa_for_each_start(xa, index, entry, start) \
xa_for_each_range(xa, index, entry, start, ULONG_MAX)
/**
* xa_for_each() - Iterate over present entries in an XArray.
* @xa: XArray.
* @index: Index of @entry.
* @entry: Entry retrieved from array.
*
* During the iteration, @entry will have the value of the entry stored
* in @xa at @index. You may modify @index during the iteration if you want
* to skip or reprocess indices. It is safe to modify the array during the
* iteration. At the end of the iteration, @entry will be set to NULL and
* @index will have a value less than or equal to max.
*
* xa_for_each() is O(n.log(n)) while xas_for_each() is O(n). You have
* to handle your own locking with xas_for_each(), and if you have to unlock
* after each iteration, it will also end up being O(n.log(n)). xa_for_each()
* will spin if it hits a retry entry; if you intend to see retry entries,
* you should use the xas_for_each() iterator instead. The xas_for_each()
* iterator will expand into more inline code than xa_for_each().
*
* Context: Any context. Takes and releases the RCU lock.
*/
#define xa_for_each(xa, index, entry) \
xa_for_each_start(xa, index, entry, 0)
/**
* xa_for_each_marked() - Iterate over marked entries in an XArray.
* @xa: XArray.
* @index: Index of @entry.
* @entry: Entry retrieved from array.
* @filter: Selection criterion.
*
* During the iteration, @entry will have the value of the entry stored
* in @xa at @index. The iteration will skip all entries in the array
* which do not match @filter. You may modify @index during the iteration
* if you want to skip or reprocess indices. It is safe to modify the array
* during the iteration. At the end of the iteration, @entry will be set to
* NULL and @index will have a value less than or equal to max.
*
* xa_for_each_marked() is O(n.log(n)) while xas_for_each_marked() is O(n).
* You have to handle your own locking with xas_for_each(), and if you have
* to unlock after each iteration, it will also end up being O(n.log(n)).
* xa_for_each_marked() will spin if it hits a retry entry; if you intend to
* see retry entries, you should use the xas_for_each_marked() iterator
* instead. The xas_for_each_marked() iterator will expand into more inline
* code than xa_for_each_marked().
*
* Context: Any context. Takes and releases the RCU lock.
*/
#define xa_for_each_marked(xa, index, entry, filter) \
for (index = 0, entry = xa_find(xa, &index, ULONG_MAX, filter); \
entry; entry = xa_find_after(xa, &index, ULONG_MAX, filter))
#define xa_trylock(xa) spin_trylock(&(xa)->xa_lock)
#define xa_lock(xa) spin_lock(&(xa)->xa_lock)
#define xa_unlock(xa) spin_unlock(&(xa)->xa_lock)
#define xa_lock_bh(xa) spin_lock_bh(&(xa)->xa_lock)
#define xa_unlock_bh(xa) spin_unlock_bh(&(xa)->xa_lock)
#define xa_lock_irq(xa) spin_lock_irq(&(xa)->xa_lock)
#define xa_unlock_irq(xa) spin_unlock_irq(&(xa)->xa_lock)
#define xa_lock_irqsave(xa, flags) \
spin_lock_irqsave(&(xa)->xa_lock, flags)
#define xa_unlock_irqrestore(xa, flags) \
spin_unlock_irqrestore(&(xa)->xa_lock, flags)
#define xa_lock_nested(xa, subclass) \
spin_lock_nested(&(xa)->xa_lock, subclass)
#define xa_lock_bh_nested(xa, subclass) \
spin_lock_bh_nested(&(xa)->xa_lock, subclass)
#define xa_lock_irq_nested(xa, subclass) \
spin_lock_irq_nested(&(xa)->xa_lock, subclass)
#define xa_lock_irqsave_nested(xa, flags, subclass) \
spin_lock_irqsave_nested(&(xa)->xa_lock, flags, subclass)
/*
* Versions of the normal API which require the caller to hold the
* xa_lock. If the GFP flags allow it, they will drop the lock to
* allocate memory, then reacquire it afterwards. These functions
* may also re-enable interrupts if the XArray flags indicate the
* locking should be interrupt safe.
*/
void *__xa_erase(struct xarray *, unsigned long index);
void *__xa_store(struct xarray *, unsigned long index, void *entry, gfp_t);
void *__xa_cmpxchg(struct xarray *, unsigned long index, void *old,
void *entry, gfp_t);
int __must_check __xa_insert(struct xarray *, unsigned long index,
void *entry, gfp_t);
int __must_check __xa_alloc(struct xarray *, u32 *id, void *entry,
struct xa_limit, gfp_t);
int __must_check __xa_alloc_cyclic(struct xarray *, u32 *id, void *entry,
struct xa_limit, u32 *next, gfp_t);
void __xa_set_mark(struct xarray *, unsigned long index, xa_mark_t);
void __xa_clear_mark(struct xarray *, unsigned long index, xa_mark_t);
/**
* xa_store_bh() - Store this entry in the XArray.
* @xa: XArray.
* @index: Index into array.
* @entry: New entry.
* @gfp: Memory allocation flags.
*
* This function is like calling xa_store() except it disables softirqs
* while holding the array lock.
*
* Context: Any context. Takes and releases the xa_lock while
* disabling softirqs.
* Return: The old entry at this index or xa_err() if an error happened.
*/
static inline void *xa_store_bh(struct xarray *xa, unsigned long index,
void *entry, gfp_t gfp)
{
void *curr;
might_alloc(gfp);
xa_lock_bh(xa);
curr = __xa_store(xa, index, entry, gfp);
xa_unlock_bh(xa);
return curr;
}
/**
* xa_store_irq() - Store this entry in the XArray.
* @xa: XArray.
* @index: Index into array.
* @entry: New entry.
* @gfp: Memory allocation flags.
*
* This function is like calling xa_store() except it disables interrupts
* while holding the array lock.
*
* Context: Process context. Takes and releases the xa_lock while
* disabling interrupts.
* Return: The old entry at this index or xa_err() if an error happened.
*/
static inline void *xa_store_irq(struct xarray *xa, unsigned long index,
void *entry, gfp_t gfp)
{
void *curr;
might_alloc(gfp);
xa_lock_irq(xa);
curr = __xa_store(xa, index, entry, gfp);
xa_unlock_irq(xa);
return curr;
}
/**
* xa_erase_bh() - Erase this entry from the XArray.
* @xa: XArray.
* @index: Index of entry.
*
* After this function returns, loading from @index will return %NULL.
* If the index is part of a multi-index entry, all indices will be erased
* and none of the entries will be part of a multi-index entry.
*
* Context: Any context. Takes and releases the xa_lock while
* disabling softirqs.
* Return: The entry which used to be at this index.
*/
static inline void *xa_erase_bh(struct xarray *xa, unsigned long index)
{
void *entry;
xa_lock_bh(xa);
entry = __xa_erase(xa, index);
xa_unlock_bh(xa);
return entry;
}
/**
* xa_erase_irq() - Erase this entry from the XArray.
* @xa: XArray.
* @index: Index of entry.
*
* After this function returns, loading from @index will return %NULL.
* If the index is part of a multi-index entry, all indices will be erased
* and none of the entries will be part of a multi-index entry.
*
* Context: Process context. Takes and releases the xa_lock while
* disabling interrupts.
* Return: The entry which used to be at this index.
*/
static inline void *xa_erase_irq(struct xarray *xa, unsigned long index)
{
void *entry;
xa_lock_irq(xa);
entry = __xa_erase(xa, index);
xa_unlock_irq(xa);
return entry;
}
/**
* xa_cmpxchg() - Conditionally replace an entry in the XArray.
* @xa: XArray.
* @index: Index into array.
* @old: Old value to test against.
* @entry: New value to place in array.
* @gfp: Memory allocation flags.
*
* If the entry at @index is the same as @old, replace it with @entry.
* If the return value is equal to @old, then the exchange was successful.
*
* Context: Any context. Takes and releases the xa_lock. May sleep
* if the @gfp flags permit.
* Return: The old value at this index or xa_err() if an error happened.
*/
static inline void *xa_cmpxchg(struct xarray *xa, unsigned long index,
void *old, void *entry, gfp_t gfp)
{
void *curr;
might_alloc(gfp);
xa_lock(xa);
curr = __xa_cmpxchg(xa, index, old, entry, gfp);
xa_unlock(xa);
return curr;
}
/**
* xa_cmpxchg_bh() - Conditionally replace an entry in the XArray.
* @xa: XArray.
* @index: Index into array.
* @old: Old value to test against.
* @entry: New value to place in array.
* @gfp: Memory allocation flags.
*
* This function is like calling xa_cmpxchg() except it disables softirqs
* while holding the array lock.
*
* Context: Any context. Takes and releases the xa_lock while
* disabling softirqs. May sleep if the @gfp flags permit.
* Return: The old value at this index or xa_err() if an error happened.
*/
static inline void *xa_cmpxchg_bh(struct xarray *xa, unsigned long index,
void *old, void *entry, gfp_t gfp)
{
void *curr;
might_alloc(gfp);
xa_lock_bh(xa);
curr = __xa_cmpxchg(xa, index, old, entry, gfp);
xa_unlock_bh(xa);
return curr;
}
/**
* xa_cmpxchg_irq() - Conditionally replace an entry in the XArray.
* @xa: XArray.
* @index: Index into array.
* @old: Old value to test against.
* @entry: New value to place in array.
* @gfp: Memory allocation flags.
*
* This function is like calling xa_cmpxchg() except it disables interrupts
* while holding the array lock.
*
* Context: Process context. Takes and releases the xa_lock while
* disabling interrupts. May sleep if the @gfp flags permit.
* Return: The old value at this index or xa_err() if an error happened.
*/
static inline void *xa_cmpxchg_irq(struct xarray *xa, unsigned long index,
void *old, void *entry, gfp_t gfp)
{
void *curr;
might_alloc(gfp);
xa_lock_irq(xa);
curr = __xa_cmpxchg(xa, index, old, entry, gfp);
xa_unlock_irq(xa);
return curr;
}
/**
* xa_insert() - Store this entry in the XArray unless another entry is
* already present.
* @xa: XArray.
* @index: Index into array.
* @entry: New entry.
* @gfp: Memory allocation flags.
*
* Inserting a NULL entry will store a reserved entry (like xa_reserve())
* if no entry is present. Inserting will fail if a reserved entry is
* present, even though loading from this index will return NULL.
*
* Context: Any context. Takes and releases the xa_lock. May sleep if
* the @gfp flags permit.
* Return: 0 if the store succeeded. -EBUSY if another entry was present.
* -ENOMEM if memory could not be allocated.
*/
static inline int __must_check xa_insert(struct xarray *xa,
unsigned long index, void *entry, gfp_t gfp)
{
int err;
might_alloc(gfp);
xa_lock(xa);
err = __xa_insert(xa, index, entry, gfp);
xa_unlock(xa);
return err;
}
/**
* xa_insert_bh() - Store this entry in the XArray unless another entry is
* already present.
* @xa: XArray.
* @index: Index into array.
* @entry: New entry.
* @gfp: Memory allocation flags.
*
* Inserting a NULL entry will store a reserved entry (like xa_reserve())
* if no entry is present. Inserting will fail if a reserved entry is
* present, even though loading from this index will return NULL.
*
* Context: Any context. Takes and releases the xa_lock while
* disabling softirqs. May sleep if the @gfp flags permit.
* Return: 0 if the store succeeded. -EBUSY if another entry was present.
* -ENOMEM if memory could not be allocated.
*/
static inline int __must_check xa_insert_bh(struct xarray *xa,
unsigned long index, void *entry, gfp_t gfp)
{
int err;
might_alloc(gfp);
xa_lock_bh(xa);
err = __xa_insert(xa, index, entry, gfp);
xa_unlock_bh(xa);
return err;
}
/**
* xa_insert_irq() - Store this entry in the XArray unless another entry is
* already present.
* @xa: XArray.
* @index: Index into array.
* @entry: New entry.
* @gfp: Memory allocation flags.
*
* Inserting a NULL entry will store a reserved entry (like xa_reserve())
* if no entry is present. Inserting will fail if a reserved entry is
* present, even though loading from this index will return NULL.
*
* Context: Process context. Takes and releases the xa_lock while
* disabling interrupts. May sleep if the @gfp flags permit.
* Return: 0 if the store succeeded. -EBUSY if another entry was present.
* -ENOMEM if memory could not be allocated.
*/
static inline int __must_check xa_insert_irq(struct xarray *xa,
unsigned long index, void *entry, gfp_t gfp)
{
int err;
might_alloc(gfp);
xa_lock_irq(xa);
err = __xa_insert(xa, index, entry, gfp);
xa_unlock_irq(xa);
return err;
}
/**
* xa_alloc() - Find somewhere to store this entry in the XArray.
* @xa: XArray.
* @id: Pointer to ID.
* @entry: New entry.
* @limit: Range of ID to allocate.
* @gfp: Memory allocation flags.
*
* Finds an empty entry in @xa between @limit.min and @limit.max,
* stores the index into the @id pointer, then stores the entry at
* that index. A concurrent lookup will not see an uninitialised @id.
*
* Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set
* in xa_init_flags().
*
* Context: Any context. Takes and releases the xa_lock. May sleep if
* the @gfp flags permit.
* Return: 0 on success, -ENOMEM if memory could not be allocated or
* -EBUSY if there are no free entries in @limit.
*/
static inline __must_check int xa_alloc(struct xarray *xa, u32 *id,
void *entry, struct xa_limit limit, gfp_t gfp)
{
int err;
might_alloc(gfp);
xa_lock(xa);
err = __xa_alloc(xa, id, entry, limit, gfp);
xa_unlock(xa);
return err;
}
/**
* xa_alloc_bh() - Find somewhere to store this entry in the XArray.
* @xa: XArray.
* @id: Pointer to ID.
* @entry: New entry.
* @limit: Range of ID to allocate.
* @gfp: Memory allocation flags.
*
* Finds an empty entry in @xa between @limit.min and @limit.max,
* stores the index into the @id pointer, then stores the entry at
* that index. A concurrent lookup will not see an uninitialised @id.
*
* Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set
* in xa_init_flags().
*
* Context: Any context. Takes and releases the xa_lock while
* disabling softirqs. May sleep if the @gfp flags permit.
* Return: 0 on success, -ENOMEM if memory could not be allocated or
* -EBUSY if there are no free entries in @limit.
*/
static inline int __must_check xa_alloc_bh(struct xarray *xa, u32 *id,
void *entry, struct xa_limit limit, gfp_t gfp)
{
int err;
might_alloc(gfp);
xa_lock_bh(xa);
err = __xa_alloc(xa, id, entry, limit, gfp);
xa_unlock_bh(xa);
return err;
}
/**
* xa_alloc_irq() - Find somewhere to store this entry in the XArray.
* @xa: XArray.
* @id: Pointer to ID.
* @entry: New entry.
* @limit: Range of ID to allocate.
* @gfp: Memory allocation flags.
*
* Finds an empty entry in @xa between @limit.min and @limit.max,
* stores the index into the @id pointer, then stores the entry at
* that index. A concurrent lookup will not see an uninitialised @id.
*
* Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set
* in xa_init_flags().
*
* Context: Process context. Takes and releases the xa_lock while
* disabling interrupts. May sleep if the @gfp flags permit.
* Return: 0 on success, -ENOMEM if memory could not be allocated or
* -EBUSY if there are no free entries in @limit.
*/
static inline int __must_check xa_alloc_irq(struct xarray *xa, u32 *id,
void *entry, struct xa_limit limit, gfp_t gfp)
{
int err;
might_alloc(gfp);
xa_lock_irq(xa);
err = __xa_alloc(xa, id, entry, limit, gfp);
xa_unlock_irq(xa);
return err;
}
/**
* xa_alloc_cyclic() - Find somewhere to store this entry in the XArray.
* @xa: XArray.
* @id: Pointer to ID.
* @entry: New entry.
* @limit: Range of allocated ID.
* @next: Pointer to next ID to allocate.
* @gfp: Memory allocation flags.
*
* Finds an empty entry in @xa between @limit.min and @limit.max,
* stores the index into the @id pointer, then stores the entry at
* that index. A concurrent lookup will not see an uninitialised @id.
* The search for an empty entry will start at @next and will wrap
* around if necessary.
*
* Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set
* in xa_init_flags().
*
* Note that callers interested in whether wrapping has occurred should
* use __xa_alloc_cyclic() instead.
*
* Context: Any context. Takes and releases the xa_lock. May sleep if
* the @gfp flags permit.
* Return: 0 if the allocation succeeded, -ENOMEM if memory could not be
* allocated or -EBUSY if there are no free entries in @limit.
*/
static inline int xa_alloc_cyclic(struct xarray *xa, u32 *id, void *entry,
struct xa_limit limit, u32 *next, gfp_t gfp)
{
int err;
might_alloc(gfp);
xa_lock(xa);
err = __xa_alloc_cyclic(xa, id, entry, limit, next, gfp);
xa_unlock(xa);
return err < 0 ? err : 0;
}
/**
* xa_alloc_cyclic_bh() - Find somewhere to store this entry in the XArray.
* @xa: XArray.
* @id: Pointer to ID.
* @entry: New entry.
* @limit: Range of allocated ID.
* @next: Pointer to next ID to allocate.
* @gfp: Memory allocation flags.
*
* Finds an empty entry in @xa between @limit.min and @limit.max,
* stores the index into the @id pointer, then stores the entry at
* that index. A concurrent lookup will not see an uninitialised @id.
* The search for an empty entry will start at @next and will wrap
* around if necessary.
*
* Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set
* in xa_init_flags().
*
* Note that callers interested in whether wrapping has occurred should
* use __xa_alloc_cyclic() instead.
*
* Context: Any context. Takes and releases the xa_lock while
* disabling softirqs. May sleep if the @gfp flags permit.
* Return: 0 if the allocation succeeded, -ENOMEM if memory could not be
* allocated or -EBUSY if there are no free entries in @limit.
*/
static inline int xa_alloc_cyclic_bh(struct xarray *xa, u32 *id, void *entry,
struct xa_limit limit, u32 *next, gfp_t gfp)
{
int err;
might_alloc(gfp);
xa_lock_bh(xa);
err = __xa_alloc_cyclic(xa, id, entry, limit, next, gfp);
xa_unlock_bh(xa);
return err < 0 ? err : 0;
}
/**
* xa_alloc_cyclic_irq() - Find somewhere to store this entry in the XArray.
* @xa: XArray.
* @id: Pointer to ID.
* @entry: New entry.
* @limit: Range of allocated ID.
* @next: Pointer to next ID to allocate.
* @gfp: Memory allocation flags.
*
* Finds an empty entry in @xa between @limit.min and @limit.max,
* stores the index into the @id pointer, then stores the entry at
* that index. A concurrent lookup will not see an uninitialised @id.
* The search for an empty entry will start at @next and will wrap
* around if necessary.
*
* Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set
* in xa_init_flags().
*
* Note that callers interested in whether wrapping has occurred should
* use __xa_alloc_cyclic() instead.
*
* Context: Process context. Takes and releases the xa_lock while
* disabling interrupts. May sleep if the @gfp flags permit.
* Return: 0 if the allocation succeeded, -ENOMEM if memory could not be
* allocated or -EBUSY if there are no free entries in @limit.
*/
static inline int xa_alloc_cyclic_irq(struct xarray *xa, u32 *id, void *entry,
struct xa_limit limit, u32 *next, gfp_t gfp)
{
int err;
might_alloc(gfp);
xa_lock_irq(xa);
err = __xa_alloc_cyclic(xa, id, entry, limit, next, gfp);
xa_unlock_irq(xa);
return err < 0 ? err : 0;
}
/**
* xa_reserve() - Reserve this index in the XArray.
* @xa: XArray.
* @index: Index into array.
* @gfp: Memory allocation flags.
*
* Ensures there is somewhere to store an entry at @index in the array.
* If there is already something stored at @index, this function does
* nothing. If there was nothing there, the entry is marked as reserved.
* Loading from a reserved entry returns a %NULL pointer.
*
* If you do not use the entry that you have reserved, call xa_release()
* or xa_erase() to free any unnecessary memory.
*
* Context: Any context. Takes and releases the xa_lock.
* May sleep if the @gfp flags permit.
* Return: 0 if the reservation succeeded or -ENOMEM if it failed.
*/
static inline __must_check
int xa_reserve(struct xarray *xa, unsigned long index, gfp_t gfp)
{
return xa_err(xa_cmpxchg(xa, index, NULL, XA_ZERO_ENTRY, gfp));
}
/**
* xa_reserve_bh() - Reserve this index in the XArray.
* @xa: XArray.
* @index: Index into array.
* @gfp: Memory allocation flags.
*
* A softirq-disabling version of xa_reserve().
*
* Context: Any context. Takes and releases the xa_lock while
* disabling softirqs.
* Return: 0 if the reservation succeeded or -ENOMEM if it failed.
*/
static inline __must_check
int xa_reserve_bh(struct xarray *xa, unsigned long index, gfp_t gfp)
{
return xa_err(xa_cmpxchg_bh(xa, index, NULL, XA_ZERO_ENTRY, gfp));
}
/**
* xa_reserve_irq() - Reserve this index in the XArray.
* @xa: XArray.
* @index: Index into array.
* @gfp: Memory allocation flags.
*
* An interrupt-disabling version of xa_reserve().
*
* Context: Process context. Takes and releases the xa_lock while
* disabling interrupts.
* Return: 0 if the reservation succeeded or -ENOMEM if it failed.
*/
static inline __must_check
int xa_reserve_irq(struct xarray *xa, unsigned long index, gfp_t gfp)
{
return xa_err(xa_cmpxchg_irq(xa, index, NULL, XA_ZERO_ENTRY, gfp));
}
/**
* xa_release() - Release a reserved entry.
* @xa: XArray.
* @index: Index of entry.
*
* After calling xa_reserve(), you can call this function to release the
* reservation. If the entry at @index has been stored to, this function
* will do nothing.
*/
static inline void xa_release(struct xarray *xa, unsigned long index)
{
xa_cmpxchg(xa, index, XA_ZERO_ENTRY, NULL, 0);
}
/* Everything below here is the Advanced API. Proceed with caution. */
/*
* The xarray is constructed out of a set of 'chunks' of pointers. Choosing
* the best chunk size requires some tradeoffs. A power of two recommends
* itself so that we can walk the tree based purely on shifts and masks.
* Generally, the larger the better; as the number of slots per level of the
* tree increases, the less tall the tree needs to be. But that needs to be
* balanced against the memory consumption of each node. On a 64-bit system,
* xa_node is currently 576 bytes, and we get 7 of them per 4kB page. If we
* doubled the number of slots per node, we'd get only 3 nodes per 4kB page.
*/
#ifndef XA_CHUNK_SHIFT
#define XA_CHUNK_SHIFT (IS_ENABLED(CONFIG_BASE_SMALL) ? 4 : 6)
#endif
#define XA_CHUNK_SIZE (1UL << XA_CHUNK_SHIFT)
#define XA_CHUNK_MASK (XA_CHUNK_SIZE - 1)
#define XA_MAX_MARKS 3
#define XA_MARK_LONGS BITS_TO_LONGS(XA_CHUNK_SIZE)
/*
* @count is the count of every non-NULL element in the ->slots array
* whether that is a value entry, a retry entry, a user pointer,
* a sibling entry or a pointer to the next level of the tree.
* @nr_values is the count of every element in ->slots which is
* either a value entry or a sibling of a value entry.
*/
struct xa_node {
unsigned char shift; /* Bits remaining in each slot */
unsigned char offset; /* Slot offset in parent */
unsigned char count; /* Total entry count */
unsigned char nr_values; /* Value entry count */
struct xa_node __rcu *parent; /* NULL at top of tree */
struct xarray *array; /* The array we belong to */
union {
struct list_head private_list; /* For tree user */
struct rcu_head rcu_head; /* Used when freeing node */
};
void __rcu *slots[XA_CHUNK_SIZE];
union {
unsigned long tags[XA_MAX_MARKS][XA_MARK_LONGS];
unsigned long marks[XA_MAX_MARKS][XA_MARK_LONGS];
};
};
void xa_dump(const struct xarray *);
void xa_dump_node(const struct xa_node *);
#ifdef XA_DEBUG
#define XA_BUG_ON(xa, x) do { \
if (x) { \
xa_dump(xa); \
BUG(); \
} \
} while (0)
#define XA_NODE_BUG_ON(node, x) do { \
if (x) { \
if (node) xa_dump_node(node); \
BUG(); \
} \
} while (0)
#else
#define XA_BUG_ON(xa, x) do { } while (0)
#define XA_NODE_BUG_ON(node, x) do { } while (0)
#endif
/* Private */
static inline void *xa_head(const struct xarray *xa)
{
return rcu_dereference_check(xa->xa_head,
lockdep_is_held(&xa->xa_lock));
}
/* Private */
static inline void *xa_head_locked(const struct xarray *xa)
{
return rcu_dereference_protected(xa->xa_head,
lockdep_is_held(&xa->xa_lock));
}
/* Private */
static inline void *xa_entry(const struct xarray *xa,
const struct xa_node *node, unsigned int offset)
{
XA_NODE_BUG_ON(node, offset >= XA_CHUNK_SIZE);
return rcu_dereference_check(node->slots[offset],
lockdep_is_held(&xa->xa_lock));
}
/* Private */
static inline void *xa_entry_locked(const struct xarray *xa,
const struct xa_node *node, unsigned int offset)
{
XA_NODE_BUG_ON(node, offset >= XA_CHUNK_SIZE);
return rcu_dereference_protected(node->slots[offset],
lockdep_is_held(&xa->xa_lock));
}
/* Private */
static inline struct xa_node *xa_parent(const struct xarray *xa,
const struct xa_node *node)
{
return rcu_dereference_check(node->parent,
lockdep_is_held(&xa->xa_lock));
}
/* Private */
static inline struct xa_node *xa_parent_locked(const struct xarray *xa,
const struct xa_node *node)
{
return rcu_dereference_protected(node->parent,
lockdep_is_held(&xa->xa_lock));
}
/* Private */
static inline void *xa_mk_node(const struct xa_node *node)
{
return (void *)((unsigned long)node | 2);
}
/* Private */
static inline struct xa_node *xa_to_node(const void *entry)
{
return (struct xa_node *)((unsigned long)entry - 2);
}
/* Private */
static inline bool xa_is_node(const void *entry)
{
return xa_is_internal(entry) && (unsigned long)entry > 4096;
}
/* Private */
static inline void *xa_mk_sibling(unsigned int offset)
{
return xa_mk_internal(offset);
}
/* Private */
static inline unsigned long xa_to_sibling(const void *entry)
{
return xa_to_internal(entry);
}
/**
* xa_is_sibling() - Is the entry a sibling entry?
* @entry: Entry retrieved from the XArray
*
* Return: %true if the entry is a sibling entry.
*/
static inline bool xa_is_sibling(const void *entry)
{
return IS_ENABLED(CONFIG_XARRAY_MULTI) && xa_is_internal(entry) &&
(entry < xa_mk_sibling(XA_CHUNK_SIZE - 1));
}
#define XA_RETRY_ENTRY xa_mk_internal(256)
/**
* xa_is_retry() - Is the entry a retry entry?
* @entry: Entry retrieved from the XArray
*
* Return: %true if the entry is a retry entry.
*/
static inline bool xa_is_retry(const void *entry)
{
return unlikely(entry == XA_RETRY_ENTRY);
}
/**
* xa_is_advanced() - Is the entry only permitted for the advanced API?
* @entry: Entry to be stored in the XArray.
*
* Return: %true if the entry cannot be stored by the normal API.
*/
static inline bool xa_is_advanced(const void *entry)
{
return xa_is_internal(entry) && (entry <= XA_RETRY_ENTRY);
}
/**
* typedef xa_update_node_t - A callback function from the XArray.
* @node: The node which is being processed
*
* This function is called every time the XArray updates the count of
* present and value entries in a node. It allows advanced users to
* maintain the private_list in the node.
*
* Context: The xa_lock is held and interrupts may be disabled.
* Implementations should not drop the xa_lock, nor re-enable
* interrupts.
*/
typedef void (*xa_update_node_t)(struct xa_node *node);
void xa_delete_node(struct xa_node *, xa_update_node_t);
/*
* The xa_state is opaque to its users. It contains various different pieces
* of state involved in the current operation on the XArray. It should be
* declared on the stack and passed between the various internal routines.
* The various elements in it should not be accessed directly, but only
* through the provided accessor functions. The below documentation is for
* the benefit of those working on the code, not for users of the XArray.
*
* @xa_node usually points to the xa_node containing the slot we're operating
* on (and @xa_offset is the offset in the slots array). If there is a
* single entry in the array at index 0, there are no allocated xa_nodes to
* point to, and so we store %NULL in @xa_node. @xa_node is set to
* the value %XAS_RESTART if the xa_state is not walked to the correct
* position in the tree of nodes for this operation. If an error occurs
* during an operation, it is set to an %XAS_ERROR value. If we run off the
* end of the allocated nodes, it is set to %XAS_BOUNDS.
*/
struct xa_state {
struct xarray *xa;
unsigned long xa_index;
unsigned char xa_shift;
unsigned char xa_sibs;
unsigned char xa_offset;
unsigned char xa_pad; /* Helps gcc generate better code */
struct xa_node *xa_node;
struct xa_node *xa_alloc;
xa_update_node_t xa_update;
struct list_lru *xa_lru;
};
/*
* We encode errnos in the xas->xa_node. If an error has happened, we need to
* drop the lock to fix it, and once we've done so the xa_state is invalid.
*/
#define XA_ERROR(errno) ((struct xa_node *)(((unsigned long)errno << 2) | 2UL))
#define XAS_BOUNDS ((struct xa_node *)1UL)
#define XAS_RESTART ((struct xa_node *)3UL)
#define __XA_STATE(array, index, shift, sibs) { \
.xa = array, \
.xa_index = index, \
.xa_shift = shift, \
.xa_sibs = sibs, \
.xa_offset = 0, \
.xa_pad = 0, \
.xa_node = XAS_RESTART, \
.xa_alloc = NULL, \
.xa_update = NULL, \
.xa_lru = NULL, \
}
/**
* XA_STATE() - Declare an XArray operation state.
* @name: Name of this operation state (usually xas).
* @array: Array to operate on.
* @index: Initial index of interest.
*
* Declare and initialise an xa_state on the stack.
*/
#define XA_STATE(name, array, index) \
struct xa_state name = __XA_STATE(array, index, 0, 0)
/**
* XA_STATE_ORDER() - Declare an XArray operation state.
* @name: Name of this operation state (usually xas).
* @array: Array to operate on.
* @index: Initial index of interest.
* @order: Order of entry.
*
* Declare and initialise an xa_state on the stack. This variant of
* XA_STATE() allows you to specify the 'order' of the element you
* want to operate on.`
*/
#define XA_STATE_ORDER(name, array, index, order) \
struct xa_state name = __XA_STATE(array, \
(index >> order) << order, \
order - (order % XA_CHUNK_SHIFT), \
(1U << (order % XA_CHUNK_SHIFT)) - 1)
#define xas_marked(xas, mark) xa_marked((xas)->xa, (mark))
#define xas_trylock(xas) xa_trylock((xas)->xa)
#define xas_lock(xas) xa_lock((xas)->xa)
#define xas_unlock(xas) xa_unlock((xas)->xa)
#define xas_lock_bh(xas) xa_lock_bh((xas)->xa)
#define xas_unlock_bh(xas) xa_unlock_bh((xas)->xa)
#define xas_lock_irq(xas) xa_lock_irq((xas)->xa)
#define xas_unlock_irq(xas) xa_unlock_irq((xas)->xa)
#define xas_lock_irqsave(xas, flags) \
xa_lock_irqsave((xas)->xa, flags)
#define xas_unlock_irqrestore(xas, flags) \
xa_unlock_irqrestore((xas)->xa, flags)
/**
* xas_error() - Return an errno stored in the xa_state.
* @xas: XArray operation state.
*
* Return: 0 if no error has been noted. A negative errno if one has.
*/
static inline int xas_error(const struct xa_state *xas)
{
return xa_err(xas->xa_node);
}
/**
* xas_set_err() - Note an error in the xa_state.
* @xas: XArray operation state.
* @err: Negative error number.
*
* Only call this function with a negative @err; zero or positive errors
* will probably not behave the way you think they should. If you want
* to clear the error from an xa_state, use xas_reset().
*/
static inline void xas_set_err(struct xa_state *xas, long err)
{
xas->xa_node = XA_ERROR(err);
}
/**
* xas_invalid() - Is the xas in a retry or error state?
* @xas: XArray operation state.
*
* Return: %true if the xas cannot be used for operations.
*/
static inline bool xas_invalid(const struct xa_state *xas)
{
return (unsigned long)xas->xa_node & 3;
}
/**
* xas_valid() - Is the xas a valid cursor into the array?
* @xas: XArray operation state.
*
* Return: %true if the xas can be used for operations.
*/
static inline bool xas_valid(const struct xa_state *xas)
{
return !xas_invalid(xas);
}
/**
* xas_is_node() - Does the xas point to a node?
* @xas: XArray operation state.
*
* Return: %true if the xas currently references a node.
*/
static inline bool xas_is_node(const struct xa_state *xas)
{
return xas_valid(xas) && xas->xa_node;
}
/* True if the pointer is something other than a node */
static inline bool xas_not_node(struct xa_node *node)
{
return ((unsigned long)node & 3) || !node;
}
/* True if the node represents RESTART or an error */
static inline bool xas_frozen(struct xa_node *node)
{
return (unsigned long)node & 2;
}
/* True if the node represents head-of-tree, RESTART or BOUNDS */
static inline bool xas_top(struct xa_node *node)
{
return node <= XAS_RESTART;
}
/**
* xas_reset() - Reset an XArray operation state.
* @xas: XArray operation state.
*
* Resets the error or walk state of the @xas so future walks of the
* array will start from the root. Use this if you have dropped the
* xarray lock and want to reuse the xa_state.
*
* Context: Any context.
*/
static inline void xas_reset(struct xa_state *xas)
{
xas->xa_node = XAS_RESTART;
}
/**
* xas_retry() - Retry the operation if appropriate.
* @xas: XArray operation state.
* @entry: Entry from xarray.
*
* The advanced functions may sometimes return an internal entry, such as
* a retry entry or a zero entry. This function sets up the @xas to restart
* the walk from the head of the array if needed.
*
* Context: Any context.
* Return: true if the operation needs to be retried.
*/
static inline bool xas_retry(struct xa_state *xas, const void *entry)
{
if (xa_is_zero(entry))
return true;
if (!xa_is_retry(entry))
return false;
xas_reset(xas);
return true;
}
void *xas_load(struct xa_state *);
void *xas_store(struct xa_state *, void *entry);
void *xas_find(struct xa_state *, unsigned long max);
void *xas_find_conflict(struct xa_state *);
bool xas_get_mark(const struct xa_state *, xa_mark_t);
void xas_set_mark(const struct xa_state *, xa_mark_t);
void xas_clear_mark(const struct xa_state *, xa_mark_t);
void *xas_find_marked(struct xa_state *, unsigned long max, xa_mark_t);
void xas_init_marks(const struct xa_state *);
bool xas_nomem(struct xa_state *, gfp_t);
void xas_destroy(struct xa_state *);
void xas_pause(struct xa_state *);
void xas_create_range(struct xa_state *);
#ifdef CONFIG_XARRAY_MULTI
int xa_get_order(struct xarray *, unsigned long index);
int xas_get_order(struct xa_state *xas);
void xas_split(struct xa_state *, void *entry, unsigned int order);
void xas_split_alloc(struct xa_state *, void *entry, unsigned int order, gfp_t);
void xas_try_split(struct xa_state *xas, void *entry, unsigned int order);
unsigned int xas_try_split_min_order(unsigned int order);
#else
static inline int xa_get_order(struct xarray *xa, unsigned long index)
{
return 0;
}
static inline int xas_get_order(struct xa_state *xas)
{
return 0;
}
static inline void xas_split(struct xa_state *xas, void *entry,
unsigned int order)
{
xas_store(xas, entry);
}
static inline void xas_split_alloc(struct xa_state *xas, void *entry,
unsigned int order, gfp_t gfp)
{
}
static inline void xas_try_split(struct xa_state *xas, void *entry,
unsigned int order)
{
}
static inline unsigned int xas_try_split_min_order(unsigned int order)
{
return 0;
}
#endif
/**
* xas_reload() - Refetch an entry from the xarray.
* @xas: XArray operation state.
*
* Use this function to check that a previously loaded entry still has
* the same value. This is useful for the lockless pagecache lookup where
* we walk the array with only the RCU lock to protect us, lock the page,
* then check that the page hasn't moved since we looked it up.
*
* The caller guarantees that @xas is still valid. If it may be in an
* error or restart state, call xas_load() instead.
*
* Return: The entry at this location in the xarray.
*/
static inline void *xas_reload(struct xa_state *xas)
{
struct xa_node *node = xas->xa_node;
void *entry;
char offset;
if (!node)
return xa_head(xas->xa);
if (IS_ENABLED(CONFIG_XARRAY_MULTI)) {
offset = (xas->xa_index >> node->shift) & XA_CHUNK_MASK;
entry = xa_entry(xas->xa, node, offset);
if (!xa_is_sibling(entry))
return entry;
offset = xa_to_sibling(entry);
} else {
offset = xas->xa_offset;
}
return xa_entry(xas->xa, node, offset);
}
/**
* xas_set() - Set up XArray operation state for a different index.
* @xas: XArray operation state.
* @index: New index into the XArray.
*
* Move the operation state to refer to a different index. This will
* have the effect of starting a walk from the top; see xas_next()
* to move to an adjacent index.
*/
static inline void xas_set(struct xa_state *xas, unsigned long index)
{
xas->xa_index = index;
xas->xa_node = XAS_RESTART;
}
/**
* xas_advance() - Skip over sibling entries.
* @xas: XArray operation state.
* @index: Index of last sibling entry.
*
* Move the operation state to refer to the last sibling entry.
* This is useful for loops that normally want to see sibling
* entries but sometimes want to skip them. Use xas_set() if you
* want to move to an index which is not part of this entry.
*/
static inline void xas_advance(struct xa_state *xas, unsigned long index)
{
unsigned char shift = xas_is_node(xas) ? xas->xa_node->shift : 0;
xas->xa_index = index;
xas->xa_offset = (index >> shift) & XA_CHUNK_MASK;
}
/**
* xas_set_order() - Set up XArray operation state for a multislot entry.
* @xas: XArray operation state.
* @index: Target of the operation.
* @order: Entry occupies 2^@order indices.
*/
static inline void xas_set_order(struct xa_state *xas, unsigned long index,
unsigned int order)
{
#ifdef CONFIG_XARRAY_MULTI
xas->xa_index = order < BITS_PER_LONG ? (index >> order) << order : 0;
xas->xa_shift = order - (order % XA_CHUNK_SHIFT);
xas->xa_sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
xas->xa_node = XAS_RESTART;
#else
BUG_ON(order > 0);
xas_set(xas, index);
#endif
}
/**
* xas_set_update() - Set up XArray operation state for a callback.
* @xas: XArray operation state.
* @update: Function to call when updating a node.
*
* The XArray can notify a caller after it has updated an xa_node.
* This is advanced functionality and is only needed by the page
* cache and swap cache.
*/
static inline void xas_set_update(struct xa_state *xas, xa_update_node_t update)
{
xas->xa_update = update;
}
static inline void xas_set_lru(struct xa_state *xas, struct list_lru *lru)
{
xas->xa_lru = lru;
}
/**
* xas_next_entry() - Advance iterator to next present entry.
* @xas: XArray operation state.
* @max: Highest index to return.
*
* xas_next_entry() is an inline function to optimise xarray traversal for
* speed. It is equivalent to calling xas_find(), and will call xas_find()
* for all the hard cases.
*
* Return: The next present entry after the one currently referred to by @xas.
*/
static inline void *xas_next_entry(struct xa_state *xas, unsigned long max)
{
struct xa_node *node = xas->xa_node;
void *entry;
if (unlikely(xas_not_node(node) || node->shift ||
xas->xa_offset != (xas->xa_index & XA_CHUNK_MASK)))
return xas_find(xas, max);
do {
if (unlikely(xas->xa_index >= max))
return xas_find(xas, max);
if (unlikely(xas->xa_offset == XA_CHUNK_MASK))
return xas_find(xas, max);
entry = xa_entry(xas->xa, node, xas->xa_offset + 1);
if (unlikely(xa_is_internal(entry)))
return xas_find(xas, max);
xas->xa_offset++;
xas->xa_index++;
} while (!entry);
return entry;
}
/* Private */
static inline unsigned int xas_find_chunk(struct xa_state *xas, bool advance,
xa_mark_t mark)
{
unsigned long *addr = xas->xa_node->marks[(__force unsigned)mark];
unsigned int offset = xas->xa_offset;
if (advance)
offset++;
if (XA_CHUNK_SIZE == BITS_PER_LONG) {
if (offset < XA_CHUNK_SIZE) {
unsigned long data = *addr & (~0UL << offset);
if (data)
return __ffs(data);
}
return XA_CHUNK_SIZE;
}
return find_next_bit(addr, XA_CHUNK_SIZE, offset);
}
/**
* xas_next_marked() - Advance iterator to next marked entry.
* @xas: XArray operation state.
* @max: Highest index to return.
* @mark: Mark to search for.
*
* xas_next_marked() is an inline function to optimise xarray traversal for
* speed. It is equivalent to calling xas_find_marked(), and will call
* xas_find_marked() for all the hard cases.
*
* Return: The next marked entry after the one currently referred to by @xas.
*/
static inline void *xas_next_marked(struct xa_state *xas, unsigned long max,
xa_mark_t mark)
{
struct xa_node *node = xas->xa_node;
void *entry;
unsigned int offset;
if (unlikely(xas_not_node(node) || node->shift))
return xas_find_marked(xas, max, mark);
offset = xas_find_chunk(xas, true, mark);
xas->xa_offset = offset;
xas->xa_index = (xas->xa_index & ~XA_CHUNK_MASK) + offset;
if (xas->xa_index > max)
return NULL;
if (offset == XA_CHUNK_SIZE)
return xas_find_marked(xas, max, mark);
entry = xa_entry(xas->xa, node, offset);
if (!entry)
return xas_find_marked(xas, max, mark);
return entry;
}
/*
* If iterating while holding a lock, drop the lock and reschedule
* every %XA_CHECK_SCHED loops.
*/
enum {
XA_CHECK_SCHED = 4096,
};
/**
* xas_for_each() - Iterate over a range of an XArray.
* @xas: XArray operation state.
* @entry: Entry retrieved from the array.
* @max: Maximum index to retrieve from array.
*
* The loop body will be executed for each entry present in the xarray
* between the current xas position and @max. @entry will be set to
* the entry retrieved from the xarray. It is safe to delete entries
* from the array in the loop body. You should hold either the RCU lock
* or the xa_lock while iterating. If you need to drop the lock, call
* xas_pause() first.
*/
#define xas_for_each(xas, entry, max) \
for (entry = xas_find(xas, max); entry; \
entry = xas_next_entry(xas, max))
/**
* xas_for_each_marked() - Iterate over a range of an XArray.
* @xas: XArray operation state.
* @entry: Entry retrieved from the array.
* @max: Maximum index to retrieve from array.
* @mark: Mark to search for.
*
* The loop body will be executed for each marked entry in the xarray
* between the current xas position and @max. @entry will be set to
* the entry retrieved from the xarray. It is safe to delete entries
* from the array in the loop body. You should hold either the RCU lock
* or the xa_lock while iterating. If you need to drop the lock, call
* xas_pause() first.
*/
#define xas_for_each_marked(xas, entry, max, mark) \
for (entry = xas_find_marked(xas, max, mark); entry; \
entry = xas_next_marked(xas, max, mark))
/**
* xas_for_each_conflict() - Iterate over a range of an XArray.
* @xas: XArray operation state.
* @entry: Entry retrieved from the array.
*
* The loop body will be executed for each entry in the XArray that
* lies within the range specified by @xas. If the loop terminates
* normally, @entry will be %NULL. The user may break out of the loop,
* which will leave @entry set to the conflicting entry. The caller
* may also call xa_set_err() to exit the loop while setting an error
* to record the reason.
*/
#define xas_for_each_conflict(xas, entry) \
while ((entry = xas_find_conflict(xas)))
void *__xas_next(struct xa_state *);
void *__xas_prev(struct xa_state *);
/**
* xas_prev() - Move iterator to previous index.
* @xas: XArray operation state.
*
* If the @xas was in an error state, it will remain in an error state
* and this function will return %NULL. If the @xas has never been walked,
* it will have the effect of calling xas_load(). Otherwise one will be
* subtracted from the index and the state will be walked to the correct
* location in the array for the next operation.
*
* If the iterator was referencing index 0, this function wraps
* around to %ULONG_MAX.
*
* Return: The entry at the new index. This may be %NULL or an internal
* entry.
*/
static inline void *xas_prev(struct xa_state *xas)
{
struct xa_node *node = xas->xa_node;
if (unlikely(xas_not_node(node) || node->shift ||
xas->xa_offset == 0))
return __xas_prev(xas);
xas->xa_index--;
xas->xa_offset--;
return xa_entry(xas->xa, node, xas->xa_offset);
}
/**
* xas_next() - Move state to next index.
* @xas: XArray operation state.
*
* If the @xas was in an error state, it will remain in an error state
* and this function will return %NULL. If the @xas has never been walked,
* it will have the effect of calling xas_load(). Otherwise one will be
* added to the index and the state will be walked to the correct
* location in the array for the next operation.
*
* If the iterator was referencing index %ULONG_MAX, this function wraps
* around to 0.
*
* Return: The entry at the new index. This may be %NULL or an internal
* entry.
*/
static inline void *xas_next(struct xa_state *xas)
{
struct xa_node *node = xas->xa_node;
if (unlikely(xas_not_node(node) || node->shift ||
xas->xa_offset == XA_CHUNK_MASK))
return __xas_next(xas);
xas->xa_index++;
xas->xa_offset++;
return xa_entry(xas->xa, node, xas->xa_offset);
}
#endif /* _LINUX_XARRAY_H */
] |
| 13/2 | 2026/02/24 23:24 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 38286 output: 40 thoughts: 2294 Thoughts: **XArray Mark Clearing Race: A Deep Dive** Okay, so I'm looking at this KCSAN report related to XArray and the `node_clear_mark` function, specifically the race condition it flags. The core of the issue seems to be the non-atomic nature of `__test_and_clear_bit` when accessing `node->marks[mark][0]`. This raises a few immediate questions: is it a real problem, and if so, how bad is it? First, the simple analysis: if the reader (some other thread) sees the old value, no harm, if it sees the new value, still no harm. The worry is seeing a "torn" value. But, and this is crucial, on x86, 64-bit aligned loads and stores are atomic. Since `xa_node` is aligned and `marks` is an array of `unsigned long`, the address `addr` (pointing to `node->marks[mark][0]`) is also aligned, which means the load should be atomic. So, on x86, it *should* be benign. But, and here's where I need to be careful, Linux kernel code often makes an assumption about aligned `unsigned long` accesses being atomic across various architectures. What if an architecture *doesn't* guarantee this? Then we have a real problem. Also, I need to check if even with atomic loads and stores, there is any other reason why a "torn" value could be seen. I also checked if the compiler generating multiple instructions for the load would cause an issue. While that could happen for a volatile pointer, it is not the case here. Actually, I recall some recent discussion about XArray using `READ_ONCE` or `atomic_long_t` to avoid these KCSAN reports, so I need to check if this patch is already in the kernel. This KCSAN report is from a recent kernel, so that's something to investigate. The presence of `instrument_read_write` in the stack trace for `xas_clear_mark` and the similar instrumentation of `xas_find_chunk` confirms the KCSAN reporting. Now, to the important question: Can this race lead to an actual crash or data corruption? If `xas_find_marked` returns an entry that's then used, that's where the danger lies. However, I know from the structure of XArray that `filemap_get_folios_tag` re-checks the entry with `xas_reload`, and that, in turn, uses `rcu_dereference_check` which also uses `READ_ONCE`. So, even if there's a transiently "wrong" value due to the race, the subsequent check should catch it. Let's also think through `xas_find_marked` going into a loop due to this race, but that looks fine because `xas_find_chunk` uses a relatively small offset, making this unlikely. Therefore, given the architecture-specific atomicity and the subsequent safety checks, the race *is* likely benign. This is a classic example of a lockless data structure race handled by the structure itself. I need to confirm though if this "torn value" really represents a single bit being set and cleared or multiple bits that are being touched. I need to make sure I am fully understanding it. The KCSAN report shows that a group of 4 bits changed at once: `0x003fffff00000000` -> `0x003ffff000000000` So, bits 32-35 of a long are cleared. This happens with multi-order entries of size 2 (or order-2). XArray only uses one bit for multi-order entries. So, is `xas_clear_mark` only clearing one bit? Or, are all the bits that represent the entry being cleared? I'll have to check the implementation of `xas_clear_mark`. Okay, I think I'm starting to get it. When there is a multi-order entry (order 2), XArray represents this in four slots. `xas->xa_offset` points to the *first* slot. So, `node_clear_mark` clears the bit associated with that particular slot. However, XArray only sets the mark on the *first* slot of a multi-order entry. So, only one bit will be set, and only one bit should be cleared. So why are 4 bits changed? This must be the result of a single write instruction that clears the 4 bits? It's not. I'm going to revisit `xas_clear_mark` and `xas_set_mark`. |
| 14/2 | 2026/02/24 23:25 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:lib/xarray.c IncludeLines:true Name:xas_set_mark] Results: map[SourceCode: 875: /**
876: * xas_set_mark() - Sets the mark on this entry and its parents.
877: * @xas: XArray operation state.
878: * @mark: Mark number.
879: *
880: * Sets the specified mark on this entry, and walks up the tree setting it
881: * on all the ancestor entries. Does nothing if @xas has not been walked to
882: * an entry, or is in an error state.
883: */
884: void xas_set_mark(const struct xa_state *xas, xa_mark_t mark)
885: {
886: struct xa_node *node = xas->xa_node;
887: unsigned int offset = xas->xa_offset;
888:
889: if (xas_invalid(xas))
890: return;
891:
892: while (node) {
893: if (node_set_mark(node, offset, mark))
894: return;
895: offset = node->offset;
896: node = xa_parent_locked(xas->xa, node);
897: }
898:
899: if (!xa_marked(xas->xa, mark))
900: xa_mark_set(xas->xa, mark);
901: }
SourceFile:lib/xarray.c] |
| 15/2 | 2026/02/24 23:25 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 42066 output: 44 thoughts: 837 Thoughts: **Analyzing a KCSAN Report: A Deep Dive into XArray Concurrency**
Okay, so I'm looking at this KCSAN report, and the core of the issue seems to be a race condition involving bit manipulation within an XArray. It initially struck me that the report only mentioned a single bit being set, but then it notes multiple bits being *cleared* – that didn't immediately compute. A multi-order entry only sets one bit, the first one. So, why are four bits being cleared?
My initial thought was maybe it wasn't a multi-order entry, but four separate entries were cleared instead. But the KCSAN report seems to indicate a single write access. Okay, so I remembered that `__test_and_clear_bit` on x86 utilizes the `btr` instruction. That only clears *one* bit. So, the report might be aggregating changes within the observed memory range. KCSAN monitors a memory range, and if multiple writes happen within the observation period, it might report a cumulative change.
Regardless of the number of bits, the race is between a writer with a lock and a reader using RCU. But could the reader *not* be using RCU? No, `filemap_get_folios_tag` is clearly using `rcu_read_lock()`.
I started to wonder if `ext4` was doing something unusual with XArray. But the stack trace, `ext4_do_writepages` -> `mpage_prepare_extent_to_map` -> `filemap_get_folios_tag`, seems pretty standard pagecache traversal.
Now, I'm thinking I should look at `xas_find_marked` implementation. Specifically, the `xas_find_chunk` function:
```c
if (XA_CHUNK_SIZE == BITS_PER_LONG) {
if (offset < XA_CHUNK_SIZE) {
unsigned long data = *addr & (~0UL << offset);
if (data)
return __ffs(data);
}
return XA_CHUNK_SIZE;
}
```
If `*addr` is being modified by `btr` on another CPU, the reader might see either the old or the new value, both of which are valid states of the bitmap, and it will return a valid bit index. So, this race is most likely benign.
Wait, I should check if there are any other similar reports in the kernel. Ah, there was a similar report for `radix_tree` which is the predecessor of XArray. The tags were also bitmaps, and they had similar races which were generally considered benign.
Let's check if there is any reason to consider it harmful. Could it lead to a NULL pointer dereference? If `xas_find_marked` returns an offset that points to a NULL slot.
```c
entry = xa_entry(xas->xa, xas->xa_node, xas->xa_offset);
if (!entry && !(xa_track_free(xas->xa) && mark == XA_FREE_MARK))
continue;
```
That's handled; the code continues if it encounters a NULL entry.
Could it lead to an infinite loop? I've checked that. Could it lead to a use-after-free? Entries are protected by RCU, and `filemap_get_folios_tag` is using `folio_try_get_rcu`.
So, after this analysis, I'm pretty confident that this is a benign race condition. Wait, but, I should check if `xas_clear_mark` is always called with the lock. The stack trace shows: `xas_lock_irqsave(&xas, flags);` Wait, let me check `__folio_start_writeback` source.
|
| 16/2 | 2026/02/24 23:25 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:mm/page-writeback.c IncludeLines:true Name:__folio_start_writeback] Results: map[SourceCode:2972: void __folio_start_writeback(struct folio *folio, bool keep_write)
2973: {
2974: long nr = folio_nr_pages(folio);
2975: struct address_space *mapping = folio_mapping(folio);
2976: int access_ret;
2977:
2978: VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio);
2979: VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
2980:
2981: if (mapping && mapping_use_writeback_tags(mapping)) {
2982: XA_STATE(xas, &mapping->i_pages, folio->index);
2983: struct inode *inode = mapping->host;
2984: struct bdi_writeback *wb;
2985: unsigned long flags;
2986: bool on_wblist;
2987:
2988: xas_lock_irqsave(&xas, flags);
2989: xas_load(&xas);
2990: folio_test_set_writeback(folio);
2991:
2992: on_wblist = mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
2993:
2994: xas_set_mark(&xas, PAGECACHE_TAG_WRITEBACK);
2995: wb = inode_to_wb(inode);
2996: wb_stat_mod(wb, WB_WRITEBACK, nr);
2997: if (!on_wblist) {
2998: wb_inode_writeback_start(wb);
2999: /*
3000: * We can come through here when swapping anonymous
3001: * folios, so we don't necessarily have an inode to
3002: * track for sync.
3003: */
3004: if (mapping->host)
3005: sb_mark_inode_writeback(mapping->host);
3006: }
3007:
3008: if (!folio_test_dirty(folio))
3009: xas_clear_mark(&xas, PAGECACHE_TAG_DIRTY);
3010: if (!keep_write)
3011: xas_clear_mark(&xas, PAGECACHE_TAG_TOWRITE);
3012: xas_unlock_irqrestore(&xas, flags);
3013: } else {
3014: folio_test_set_writeback(folio);
3015: }
3016:
3017: lruvec_stat_mod_folio(folio, NR_WRITEBACK, nr);
3018: zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, nr);
3019:
3020: access_ret = arch_make_folio_accessible(folio);
3021: /*
3022: * If writeback has been triggered on a page that cannot be made
3023: * accessible, it is too late to recover here.
3024: */
3025: VM_BUG_ON_FOLIO(access_ret != 0, folio);
3026: }
SourceFile:mm/page-writeback.c] |
| 17/2 | 2026/02/24 23:25 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 44677 output: 47 thoughts: 114 Thoughts: **Analyzing the XArray Locking in Ext4** Okay, so I'm looking at this code and, yes, I see that `xas_lock_irqsave(&xas, flags);` is definitely called on line 2988. Good, that confirms my initial assumption about the lock being acquired there. And I also note that `xas_clear_mark` is called on lines 3009 and 3011. This strongly suggests the writer holds the lock between those calls. It all seems pretty standard so far, nothing jumps out as an immediate red flag. Now, the interesting part is the reader. We have `filemap_get_folios_tag` being called from `mpage_prepare_extent_to_map`. That's where I need to dig deeper. I need to understand the read/write concurrency. I'll need to check the function `mpage_prepare_extent_to_map`. Time to switch over to `fs/ext4/inode.c` and see what's happening there with the `filemap_get_folios_tag` calls. This should give me a better understanding of how the XArray is used in this read path. |
| 18/2 | 2026/02/24 23:25 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/ext4/inode.c IncludeLines:true Name:mpage_prepare_extent_to_map] Results: map[SourceCode:2586: /*
2587: * mpage_prepare_extent_to_map - find & lock contiguous range of dirty pages
2588: * needing mapping, submit mapped pages
2589: *
2590: * @mpd - where to look for pages
2591: *
2592: * Walk dirty pages in the mapping. If they are fully mapped, submit them for
2593: * IO immediately. If we cannot map blocks, we submit just already mapped
2594: * buffers in the page for IO and keep page dirty. When we can map blocks and
2595: * we find a page which isn't mapped we start accumulating extent of buffers
2596: * underlying these pages that needs mapping (formed by either delayed or
2597: * unwritten buffers). We also lock the pages containing these buffers. The
2598: * extent found is returned in @mpd structure (starting at mpd->lblk with
2599: * length mpd->len blocks).
2600: *
2601: * Note that this function can attach bios to one io_end structure which are
2602: * neither logically nor physically contiguous. Although it may seem as an
2603: * unnecessary complication, it is actually inevitable in blocksize < pagesize
2604: * case as we need to track IO to all buffers underlying a page in one io_end.
2605: */
2606: static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
2607: {
2608: struct address_space *mapping = mpd->inode->i_mapping;
2609: struct folio_batch fbatch;
2610: unsigned int nr_folios;
2611: pgoff_t index = mpd->start_pos >> PAGE_SHIFT;
2612: pgoff_t end = mpd->end_pos >> PAGE_SHIFT;
2613: xa_mark_t tag;
2614: int i, err = 0;
2615: ext4_lblk_t lblk;
2616: struct buffer_head *head;
2617: handle_t *handle = NULL;
2618: int bpp = ext4_journal_blocks_per_folio(mpd->inode);
2619:
2620: tag = wbc_to_tag(mpd->wbc);
2621:
2622: mpd->map.m_len = 0;
2623: mpd->next_pos = mpd->start_pos;
2624: if (ext4_should_journal_data(mpd->inode)) {
2625: handle = ext4_journal_start(mpd->inode, EXT4_HT_WRITE_PAGE,
2626: bpp);
2627: if (IS_ERR(handle))
2628: return PTR_ERR(handle);
2629: }
2630: folio_batch_init(&fbatch);
2631: while (index <= end) {
2632: nr_folios = filemap_get_folios_tag(mapping, &index, end,
2633: tag, &fbatch);
2634: if (nr_folios == 0)
2635: break;
2636:
2637: for (i = 0; i < nr_folios; i++) {
2638: struct folio *folio = fbatch.folios[i];
2639:
2640: /*
2641: * Accumulated enough dirty pages? This doesn't apply
2642: * to WB_SYNC_ALL mode. For integrity sync we have to
2643: * keep going because someone may be concurrently
2644: * dirtying pages, and we might have synced a lot of
2645: * newly appeared dirty pages, but have not synced all
2646: * of the old dirty pages.
2647: */
2648: if (mpd->wbc->sync_mode == WB_SYNC_NONE &&
2649: mpd->wbc->nr_to_write <=
2650: EXT4_LBLK_TO_PG(mpd->inode, mpd->map.m_len))
2651: goto out;
2652:
2653: /* If we can't merge this page, we are done. */
2654: if (mpd->map.m_len > 0 &&
2655: mpd->next_pos != folio_pos(folio))
2656: goto out;
2657:
2658: if (handle) {
2659: err = ext4_journal_ensure_credits(handle, bpp,
2660: 0);
2661: if (err < 0)
2662: goto out;
2663: }
2664:
2665: folio_lock(folio);
2666: /*
2667: * If the page is no longer dirty, or its mapping no
2668: * longer corresponds to inode we are writing (which
2669: * means it has been truncated or invalidated), or the
2670: * page is already under writeback and we are not doing
2671: * a data integrity writeback, skip the page
2672: */
2673: if (!folio_test_dirty(folio) ||
2674: (folio_test_writeback(folio) &&
2675: (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
2676: unlikely(folio->mapping != mapping)) {
2677: folio_unlock(folio);
2678: continue;
2679: }
2680:
2681: folio_wait_writeback(folio);
2682: BUG_ON(folio_test_writeback(folio));
2683:
2684: /*
2685: * Should never happen but for buggy code in
2686: * other subsystems that call
2687: * set_page_dirty() without properly warning
2688: * the file system first. See [1] for more
2689: * information.
2690: *
2691: * [1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
2692: */
2693: if (!folio_buffers(folio)) {
2694: ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", folio->index);
2695: folio_clear_dirty(folio);
2696: folio_unlock(folio);
2697: continue;
2698: }
2699:
2700: if (mpd->map.m_len == 0)
2701: mpd->start_pos = folio_pos(folio);
2702: mpd->next_pos = folio_next_pos(folio);
2703: /*
2704: * Writeout when we cannot modify metadata is simple.
2705: * Just submit the page. For data=journal mode we
2706: * first handle writeout of the page for checkpoint and
2707: * only after that handle delayed page dirtying. This
2708: * makes sure current data is checkpointed to the final
2709: * location before possibly journalling it again which
2710: * is desirable when the page is frequently dirtied
2711: * through a pin.
2712: */
2713: if (!mpd->can_map) {
2714: err = mpage_submit_folio(mpd, folio);
2715: if (err < 0)
2716: goto out;
2717: /* Pending dirtying of journalled data? */
2718: if (folio_test_checked(folio)) {
2719: err = mpage_journal_page_buffers(handle,
2720: mpd, folio);
2721: if (err < 0)
2722: goto out;
2723: mpd->journalled_more_data = 1;
2724: }
2725: mpage_folio_done(mpd, folio);
2726: } else {
2727: /* Add all dirty buffers to mpd */
2728: lblk = EXT4_PG_TO_LBLK(mpd->inode, folio->index);
2729: head = folio_buffers(folio);
2730: err = mpage_process_page_bufs(mpd, head, head,
2731: lblk);
2732: if (err <= 0)
2733: goto out;
2734: err = 0;
2735: }
2736: }
2737: folio_batch_release(&fbatch);
2738: cond_resched();
2739: }
2740: mpd->scanned_until_end = 1;
2741: if (handle)
2742: ext4_journal_stop(handle);
2743: return 0;
2744: out:
2745: folio_batch_release(&fbatch);
2746: if (handle)
2747: ext4_journal_stop(handle);
2748: return err;
2749: }
SourceFile:fs/ext4/inode.c] |
| 19/2 | 2026/02/24 23:25 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 47715 output: 47 thoughts: 1397 Thoughts: **Analyzing the Potential Data Race in XArray Mark Clearing** Okay, let's break this down. At line 2665, we have `folio_lock(folio)`. Good. This sets the stage for checking dirty and writeback status. The core of the problem seems to be a potential race between the `filemap_get_folios_tag` function, which reads and acts upon XArray marks, and the `__folio_start_writeback` function, which clears those marks. My initial thought was, if `filemap_get_folios_tag` sees a mark, but `__folio_start_writeback` is concurrently clearing it, we have a problem. The key here is the order of operations in `__folio_start_writeback`: it *sets* the writeback bit before clearing the mark. Let's verify. Yes, the code locks, sets `folio_test_set_writeback`, and *then*, if the folio isn't dirty, clears the mark using `xas_clear_mark`. So if `mpage_prepare_extent_to_map` gets a folio that's in writeback, it'll correctly detect that and either wait or skip the folio, depending on the sync mode. This part looks safe. The potential race is on the XArray mark bitmap itself. The *writer* holds a spinlock, and the *reader* uses RCU. The writer uses non-atomic bit operations, while the reader uses plain loads. On x86, this is fine because of the inherent memory model guarantees; we shouldn't see tearing. The logic handles both seeing the old and the new values correctly. Therefore, so far, it *should* be benign. But wait, I need to be *sure*. What about compiler reordering? In `xas_find_chunk`, it's just a single load. In `node_clear_mark`, it's just a single `btr` instruction. Does `btr` have a memory barrier? No, not without a `lock` prefix. But it's a single instruction, so even without a memory barrier, it *should* be fine. Okay, let's dig deeper. Is there a scenario where `xas_find_marked` used in a loop could lead to an inconsistent state of the XArray tree? The XArray uses RCU for its tree structure, and marks are part of the nodes themselves. When a mark is cleared, it's an in-place modification of the node. If the node itself is replaced, RCU is used. Here, we're just talking about bit changes. If the reader sees a mark in a parent but not the child, it handles that case. If the reader sees a mark in the child, but not the parent, it won't even *get* to the child node (unless it was already there). What about the case where it's at the child, and we're iterating to the next marked entry? It calls `xas_find_chunk`. If `xas_find_chunk` *finds* a mark, it returns the offset. If not, it bubbles up to the parent. The logic *seems* robust and correct in this scenario. One more thing to consider: `__test_and_clear_bit` is non-atomic. That makes it a read-modify-write operation. If *two* writers were competing, we'd have a problem. But there's only *one* writer (holding the spinlock). Readers *only* read. So, the only remaining risk is a reader seeing a torn value. And, as we've established, that's not a risk on x86 for aligned 64-bit accesses. Okay, given all that, I think this is indeed benign. Let's make sure I'm doing everything correctly... ah yes, I need to use `set-results`. Okay, one final double check. I was looking at these bit shifts, specifically `0x003fffff00000000` -> `0x003ffff000000000`. I want to make *absolutely* sure I understand that. Let's look at the bits. So, `0x3fffff` has 22 bits set, `0x3ffff0` is those 22 bits with the last 4 bits zeroed. So, the lower 4 bits of the 22-bit value have been cleared. `0x003fffff00000000` is the `0x3fffff` value shifted left by 32 bits, which means bits 32-53 are set. And `0 |
| 20/2 | 2026/02/24 23:25 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:fs/ext4/page-io.c IncludeLines:true Name:ext4_bio_write_folio] Results: map[SourceCode: 458: int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,
459: size_t len)
460: {
461: struct folio *io_folio = folio;
462: struct inode *inode = folio->mapping->host;
463: unsigned block_start;
464: struct buffer_head *bh, *head;
465: int ret = 0;
466: int nr_to_submit = 0;
467: struct writeback_control *wbc = io->io_wbc;
468: bool keep_towrite = false;
469:
470: BUG_ON(!folio_test_locked(folio));
471: BUG_ON(folio_test_writeback(folio));
472:
473: /*
474: * Comments copied from block_write_full_folio:
475: *
476: * The folio straddles i_size. It must be zeroed out on each and every
477: * writepage invocation because it may be mmapped. "A file is mapped
478: * in multiples of the page size. For a file that is not a multiple of
479: * the page size, the remaining memory is zeroed when mapped, and
480: * writes to that region are not written out to the file."
481: */
482: if (len < folio_size(folio))
483: folio_zero_segment(folio, len, folio_size(folio));
484: /*
485: * In the first loop we prepare and mark buffers to submit. We have to
486: * mark all buffers in the folio before submitting so that
487: * folio_end_writeback() cannot be called from ext4_end_bio() when IO
488: * on the first buffer finishes and we are still working on submitting
489: * the second buffer.
490: */
491: bh = head = folio_buffers(folio);
492: do {
493: block_start = bh_offset(bh);
494: if (block_start >= len) {
495: clear_buffer_dirty(bh);
496: set_buffer_uptodate(bh);
497: continue;
498: }
499: if (!buffer_dirty(bh) || buffer_delay(bh) ||
500: !buffer_mapped(bh) || buffer_unwritten(bh)) {
501: /* A hole? We can safely clear the dirty bit */
502: if (!buffer_mapped(bh))
503: clear_buffer_dirty(bh);
504: /*
505: * Keeping dirty some buffer we cannot write? Make sure
506: * to redirty the folio and keep TOWRITE tag so that
507: * racing WB_SYNC_ALL writeback does not skip the folio.
508: * This happens e.g. when doing writeout for
509: * transaction commit or when journalled data is not
510: * yet committed.
511: */
512: if (buffer_dirty(bh) ||
513: (buffer_jbd(bh) && buffer_jbddirty(bh))) {
514: if (!folio_test_dirty(folio))
515: folio_redirty_for_writepage(wbc, folio);
516: keep_towrite = true;
517: }
518: continue;
519: }
520: if (buffer_new(bh))
521: clear_buffer_new(bh);
522: set_buffer_async_write(bh);
523: clear_buffer_dirty(bh);
524: nr_to_submit++;
525: } while ((bh = bh->b_this_page) != head);
526:
527: /* Nothing to submit? Just unlock the folio... */
528: if (!nr_to_submit)
529: return 0;
530:
531: bh = head = folio_buffers(folio);
532:
533: /*
534: * If any blocks are being written to an encrypted file, encrypt them
535: * into a bounce page. For simplicity, just encrypt until the last
536: * block which might be needed. This may cause some unneeded blocks
537: * (e.g. holes) to be unnecessarily encrypted, but this is rare and
538: * can't happen in the common case of blocksize == PAGE_SIZE.
539: */
540: if (fscrypt_inode_uses_fs_layer_crypto(inode)) {
541: gfp_t gfp_flags = GFP_NOFS;
542: unsigned int enc_bytes = round_up(len, i_blocksize(inode));
543: struct page *bounce_page;
544:
545: /*
546: * Since bounce page allocation uses a mempool, we can only use
547: * a waiting mask (i.e. request guaranteed allocation) on the
548: * first page of the bio. Otherwise it can deadlock.
549: */
550: if (io->io_bio)
551: gfp_flags = GFP_NOWAIT;
552: retry_encrypt:
553: bounce_page = fscrypt_encrypt_pagecache_blocks(folio,
554: enc_bytes, 0, gfp_flags);
555: if (IS_ERR(bounce_page)) {
556: ret = PTR_ERR(bounce_page);
557: if (ret == -ENOMEM &&
558: (io->io_bio || wbc->sync_mode == WB_SYNC_ALL)) {
559: gfp_t new_gfp_flags = GFP_NOFS;
560: if (io->io_bio)
561: ext4_io_submit(io);
562: else
563: new_gfp_flags |= __GFP_NOFAIL;
564: memalloc_retry_wait(gfp_flags);
565: gfp_flags = new_gfp_flags;
566: goto retry_encrypt;
567: }
568:
569: printk_ratelimited(KERN_ERR "%s: ret = %d\n", __func__, ret);
570: folio_redirty_for_writepage(wbc, folio);
571: do {
572: if (buffer_async_write(bh)) {
573: clear_buffer_async_write(bh);
574: set_buffer_dirty(bh);
575: }
576: bh = bh->b_this_page;
577: } while (bh != head);
578:
579: return ret;
580: }
581: io_folio = page_folio(bounce_page);
582: }
583:
584: __folio_start_writeback(folio, keep_towrite);
585:
586: /* Now submit buffers to write */
587: do {
588: if (!buffer_async_write(bh))
589: continue;
590: io_submit_add_bh(io, inode, folio, io_folio, bh);
591: } while ((bh = bh->b_this_page) != head);
592:
593: return 0;
594: }
SourceFile:fs/ext4/page-io.c] |
| 21/2 | 2026/02/24 23:25 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 52381 output: 19 thoughts: 721 Thoughts: **Analyzing a Potential Data Race in ext4 and XArray** Okay, let's break this down. `ext4_bio_write_folio` calls `__folio_start_writeback` once per folio. Right, so we're dealing with potentially multiple pages, maybe an order-2 folio which would have 4 pages. In XArray, that translates to 4 slots, but it's *one* entry spanning those 4 indices. The marks for these indices are stored within the node, which is key. The initial worry was whether XArray only set the mark on the first index, which should mean only one bit set, but that's not the case. But if it's not a multi-order entry, and we have 4 separate pages (which is very likely given ext4's folio focus), then `ext4_do_writepages` would call `ext4_bio_write_folio` four times. Yet, KCSAN is reporting a *single* race. That's the crux. Aha! The "value changed" is the 64-bit word in the bitmap. If four bits are cleared, it suggests four bits were *previously* set. This means four separate folios were processed. KCSAN is observing the word change from "four bits set" to "zero bits set" while being read. The reader, using `xas_find_marked`, must be reading this multiple times, perhaps in a loop? No, that reads it only once per chunk. So, the race is between a locked writer and an RCU reader on a bitmap. This bitmap's logic is robust and shouldn't exhibit tearing issues, even with a race. Let's revisit the definition of "benign." A benign data race usually involves simple variables, unprotected accesses, and things like flags or statistics counters. However, in our case, there's a spinlock protecting the writers, but RCU-style readers are lockless. XArray marks *are* essentially flags. A non-benign race could lead to corruption with complex containers, like lists or hashmaps, where accesses should be mutually exclusive. XArray is complex, but the race is focused on a bitmap field within it. The container *is* designed to handle this specific read race, making it intentional and safe. So it's a "benign" race in the sense that it's designed and intentional, not a bug. One final check: `xas_find_chunk` uses `__ffs(data)`. If `data` is zero, `__ffs(0)` is undefined. But, thankfully, `xas_find_chunk` has a check: `if (data)`. This makes it safe. I'm confident we've got a solid understanding of the issue. I'm ready. |
| 22/2 | 2026/02/24 23:25 | tool | set-results |
0mArgs:map[Benign:true Confident:true] Results: map[Benign:true Confident:true] |
| 23/2 | 2026/02/24 23:25 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 53744 output: 721 thoughts: 168 Thoughts: **Benign Data Race in XArray - No Cause for Alarm** Okay, the analysis is done, and it's confirmed: there's a data race. My initial instinct was right, but let's drill down. It's a classic: a writer, holding the XArray spinlock, is messing with the bitmap using non-atomic bit operations. On the other hand, the reader is using RCU, which isn't locked, and is using regular loads to check the markings. But here's where it gets interesting – and why I'm ultimately not concerned. We're on x86_64, and *aligned* 64-bit loads and stores are atomic. So, we're not dealing with torn reads here, where the reader grabs a partially written value. The XArray design itself, and crucially, its callers, seem to be designed to handle this kind of transient inconsistency. Specifically, that `filemap_get_folios_tag` function, which is critical in this scenario, re-verifies the folio's state after it's been locked. That's the key. That re-check guarantees that any minor glitch, caused by the concurrent operations, doesn't actually corrupt the underlying logic. It means that, even if the reader sees an intermediate bitmap state for a fleeting moment, the follow-up lock and verification ensures the correct state is enforced. It's a clever, resilient design. Therefore, I can confidently conclude that the data race, while present, is completely benign. We can move on. |
| Total Calls | Total Tokens | Avg Tokens | Total Duration (Seconds) | Avg Duration (Seconds) |
|---|
| Total Calls | Total Duration (Seconds) | Avg Duration (Seconds) |
|---|