Cake have awesome isunique rule, however I’m often importing csvs with a lot of columns per row (or data for more than one table), often including email and I want to maintain unique emails.
To speed things up I’m saving 50 rows at a time with saveManyOrFail and its throwing error on duplicates - cool.
But would it be better (performance wise) to run select query first with those upcoming 50 emails to import, and remove duplicates from saveMany (also disabling this isunique rule before such save) to have clean save with only new and unique entries?
I have no evidence either way, but my intuition says:
When there are high duplicate counts, filtering first would pay off handsomely.
For low collision counts you wouldn’t see any gains but also wouldn’t see significant losses.
If there were a lot of 50-row chunks to process, you would start to accumulate costs of preparing the email query again and again plus the cost of running that query. But this is balanced by the fact that you could eliminate the Rule that watches for duplicates. And you would have options for preparing a query for reuse rather that depending on the Query class, thus bypassing the query-construction overhead (but locking yourself to your current db language).
Doing the optimization you suggest feels like a good choice to me