quicksort has large run times and deep function call stacks
on arrays with many duplicates
This is only true for Lomuto type partition scheme, a variation of quicksort that first appears to be mentioned in 1984, versus the original Hoare partition scheme published in 1961. Normally Lomuto partition step results in 3 sub-partitions: elements < pivot, pivot (single element), elements >= pivot. In the case where all elements are equal, then a partition step results in 0 elements < pivot, pivot, n-1 elements >= pivot, recursing on the n-1 element partition, only removing one element per level of recursion, the worst case behavior for quicksort with time complexity O(n^2).
For Hoare partition scheme, as the number of duplicates increases, the partitioning will result in closer to equal sized sub-partitions. In the case where all elements are equal, then Hoare partition scheme results in an ideal 50% / 50% split, because the partitioning loop results in the working left (i in the wiki article) and right (j in the wiki article) indexes meeting at the middle. There will be unneeded swaps of equal elements, but the check to avoid swaps generally increases run time more than just doing the swaps for typical cases. In general, as the percentage of duplicates increases, the greater the probability that the indexes will meet near the middle of a partition.
Doing a 3 way partition as suggested in the original question helps when the pivot is one of the duplicate values. It may take quite a few levels of recursion before a pivot is one of the duplicates in a sub-array, but will eliminate the pivot and it's duplicates from further recursion. This will avoid worst case time complexity O(n^2) due to a large number of equal values, but will be slower than Hoare unless there are significant number of equal values.
https://en.wikipedia.org/wiki/Quicksort#Lomuto_partition_scheme
https://en.wikipedia.org/wiki/Quicksort#Hoare_partition_scheme
https://en.wikipedia.org/wiki/Quicksort#Repeated_elements
Although not asked for in the original question, some comments have been made asking to explain Hoare and Lomuto schemes. Both schemes use two indexes, and are optimized to reduce the number of compares (at the cost of more swaps). These explanations are for the Wiki examples for Hoare and Lomuto linked to above, and I added a Dutch National Flag example based on Lomuto.
Hoare - set pivot = middle element of array, array[floor((lo + hi)/2)]. naive implementation for inner loops: scan left to right for an element > pivot (index i), scan right to left for an element < pivot (index j), swap the elements, continue this process until the two indexes meet or cross. The inner loops need to check to make sure an index is not advanced beyond the bounds of the sub-array being partitioned. Hoare optimized this by changing the compares to >= and <=, which eliminates the need for a check for going beyond the bounds of a sub-array, at the cost of sometimes swapping equal elements. After this, elements of array[lo .. j] are <= pivot and elements of array[j+1 .. hi] are >= pivot.
Lomuto - set pivot = array[hi]. i initialized to lo. j current index to find elements < pivot. Main loop: for j = lo until j == hi increment j: if array[j] < pivot, then: swap(array[i], array[j]) and increment i. After each iteration, elements of array[lo .. i-1] < pivot, array[i .. j-1] >= pivot. Note that i == j until the first instance of array[j] >= pivot, and although swapping when i == j is needless, it's faster than checking for i==j to avoid the swap, since i != j after the first instance of array[j] >= pivot. Once the main loop is done, array[i] is swapped with array[hi] to put the pivot into place. After this, elements of array[lo .. i-1] are < pivot, array[i] == pivot, elements of array[i+1 .. hi] are >= pivot.
Lomuto based Dutch National flag scheme - After Lomuto main partition loop is done, repeat the process using similar logic on array[i+1 .. hi], (using j and k), to end up with array[lo .. i-1] < pivot, array[i .. j] == pivot, array[j+1 .. hi] > pivot.