Question 1

Why is the first occurrence kept rather than the last?

Accepted Answer

The dedup pass walks the input lines once and inserts each into a JavaScript Set keyed on the comparison form. ECMA-262 specifies that Set iteration order equals insertion order, so the first time a key appears the original line is kept and any subsequent duplicates are dropped. This preserves order-as-meaning patterns (timestamped logs, ranked lists, ordered CSVs) where the first row is canonical. If last-occurrence-wins is needed, the inverse is achievable by reversing input → dedup → reversing output, but most use cases — email exports, CSV cleanup, log compression — want first-wins.

Question 2

How does case-insensitive comparison handle Unicode edge cases?

Accepted Answer

The comparison key is generated with String.prototype.toLowerCase, which performs simple Unicode case folding — single-character mappings only. This matches what most users expect ('Apple' = 'apple' = 'APPLE') but does not handle a few full-folding cases defined by UCD CaseFolding.txt: German ß folds to 'ss' only under full folding, and the Turkish dotless I/i pair is another classic locale-dependent case. For everyday lists — emails, CSVs, log lines — simple folding is correct; for German legal text or Turkish, routing the comparison through Intl.Collator(locale, { sensitivity: 'accent' }) handles those cases instead.

Question 3

What is the time complexity, and how does it scale?

Accepted Answer

The algorithm is a single-pass O(n) walk over the input lines. ECMA-262 specifies Set.prototype.has and Set.prototype.add as sublinear; mainstream engines (V8, SpiderMonkey, JavaScriptCore) implement Set on hash tables, where amortized O(1) follows from the standard hash-table analysis (Knuth, TAOCP Vol 3 §6.4 Hashing). The total work for n input lines is O(n) inserts and O(n) lookups. The pipeline scales linearly: a 10,000-line list dedupes in milliseconds, and a 100,000-line list completes in well under a second on typical hardware. Memory grows with the number of unique lines, not the input length — duplicate-heavy inputs (a noisy log with repeated errors) compress to a small unique set.

Question 4

Why does whitespace trimming apply to both the comparison key AND the output line?

Accepted Answer

This is a deliberate design choice: when 'trim whitespace' is enabled, '  apple', 'apple ', and 'apple' are not only treated as equal during comparison, they all export as the clean 'apple' in the result. Trimming for comparison only — keeping the original spacing — is also defensible (some uses care about preserving exact bytes), but for typical use cases (email exports, CSV cleanup, list sanitization) the user wants both: collapse equivalents AND clean the survivors. If comparison-only trimming is required, pre-processing with text-replace before dedup with trimming off achieves that.

Question 5

How does this tool handle accessibility for screen readers?

Accepted Answer

The output region and the stats line (original count, unique count, removed count) sit inside an aria-live="polite" region, the W3C WCAG Success Criterion 4.1.3 (Status Messages, introduced in WCAG 2.1, Recommendation 5 June 2018; carried unchanged into WCAG 2.2, Recommendation 5 October 2023) pattern. Polite live regions queue announcements after any speech in progress, appropriate for incremental updates as the user types or pastes. Screen readers (NVDA, JAWS, VoiceOver) consume the live region automatically; the user does not need to do anything else.

Remove Duplicate Lines — Deduplicate Text & Filter Unique Lines

Remove Duplicate Lines Online — Deduplicate Text & Filter Unique Lines

Frequently asked questions