Discussions
Categories
 382.6K All Categories
 2.1K Data
 213 Big Data Appliance
 1.9K Data Science
 448.5K Databases
 221.1K General Database Discussions
 3.7K Java and JavaScript in the Database
 25 Multilingual Engine
 542 MySQL Community Space
 469 NoSQL Database
 7.8K Oracle Database Express Edition (XE)
 2.9K ORDS, SODA & JSON in the Database
 503 SQLcl
 3.9K SQL Developer Data Modeler
 186.3K SQL & PL/SQL
 21.1K SQL Developer
 293.9K Development
 9 Developer Projects
 131 Programming Languages
 290.7K Development Tools
 97 DevOps
 3K QA/Testing
 645.6K Java
 24 Java Learning Subscription
 36.9K Database Connectivity
 151 Java Community Process
 104 Java 25
 22.1K Java APIs
 137.9K Java Development Tools
 165.3K Java EE (Java Enterprise Edition)
 17 Java Essentials
 147 Java 8 Questions
 85.9K Java Programming
 79 Java Puzzle Ball
 65.1K New To Java
 1.7K Training / Learning / Certification
 13.8K Java HotSpot Virtual Machine
 94.2K Java SE
 13.8K Java Security
 201 Java User Groups
 24 JavaScript  Nashorn
 Programs
 299 LiveLabs
 36 Workshops
 10.2K Software
 6.7K Berkeley DB Family
 3.5K JHeadstart
 5.8K Other Languages
 2.3K Chinese
 168 Deutsche Oracle Community
 1.2K Español
 1.9K Japanese
 235 Portuguese
Subset sum problem
I apologize in advance if my question is not appropriate for this forum, as it is primarily a question about algorithms rather than specific implementation in PL/SQL and/or Oracle SQL. (Although those matter too, of course.)
The subset sum problem is a classical, wellknown and difficult combinatorial problem. https://en.wikipedia.org/wiki/Subset_sum_problem It has several moreorless equivalent formulations; I am interested specifically in the following one:
We are given a set X of positive numbers (possibly with duplicates  that doesn't make the problem any harder) and a positive number N. We need to find a subset Y of X so that the sum of elements in Y is exactly N, or if no such subset exists, we need to know that information. If more than one such subset Y exists, we will be satisfied with one of them (it doesn't matter which one).
Example: X = {3, 8, 22, 40, 17, 32, 4} and N = 47. In this case the subset Y = {3, 4, 40} is good. So is Y = {8, 22, 17} but we don't care  as soon as we found ONE solution we may declare victory.
One way to approach this problem is to build all the possible subsets of X, compute the sum, compare to N. This is easy to understand and to code, but if the cardinality of X is C, then there are power(2, C) subsets, and this may get out of hand quickly.
The Wikipedia article I linked to above discusses several other approaches to this problem (in addition to the brute force approach of simply considering all the subsets). This morning I implemented in PL/SQL the HorowitzSahni algorithm described there; I find that it can handle, for example, a set X with 45 elements and find a subset with a given sum N (a specific value which is the sum of a subset of 15 elements) in about 30 seconds.
I am interested in other algorithms and/or implementations of those algorithms in PL/SQL.
In particular, @Stew Ashton, in a different thread ( ) mentioned that he has a method that can find a number that is the sum of a subset of 20 elements out of a set X of 60 elements in 0.002 seconds. I would be very interested both in the algorithm and the implementation for that solution.
My original intent was to post my PL/SQL implementation of the HorowitzSahni algorithm in this thread, and ask for critique/suggestions for optimization etc.; however, if much better/faster methods exist, that would be a waste of everybody's time (not mine for developing it  I learned a lot of PL/SQL in the process  but that only benefits me).
Answers

Hi mathguy,
Posting your PL/SQL solution would definitely benefit others too on this forum. Kindly request you to post the same. As people already know that implementation of HorowitzSahni algorithm has certain limitations (as descried by you), it's up to the users to take it.
Regards,
Ranagal

Pisinger's algorithm is supposed to be linear time "if set members are positive and bounded by the same constant."

Thank you for your interest.
The algorithm is based on the following idea:
Suppose we write X as a disjoint union of two subsets, A and B. That means: X is the (multiset) union of A and B; and every element of X is either in A or B but not in both. A bit of care must be taken here with multisets (duplicate values); the way I like to think of it  and that is how I implemented  is that elements of X have both a unique identified and a weight (or a value). We distinguish the elements from each other by their unique id, but different elements are allowed to have the same weight, and the subset sum problem is expressed in terms of weights.
So, continuing with the idea: Suppose X is the disjoint union of A and B. It doesn't matter how we choose A and B, the argument below still applies; in practice, to minimize the number of required computations, it helps if A and B each have half the elements of X (or approximately that, if the cardinality of X is an odd number).
Now suppose we compute the subset sums for all subsets of A, and put the sums in an ascending array x(1), x(1), ... and then we do the same thing with all the subsets of B and put the subset sums in an ascending array y(1), y(2), ... Note that if X has, say, 40 elements, then it has power(2, 40) subsets. However, A and B each has 20 elements (if we choose them that way), so we only need to compute 2 * power(2, 20) sums  a much, much lower number.
Now, if a subset Y of X has the subset sum equal to the target number N, then Y is the disjoint union of (Y intersect A) and (Y intersect . Correspondingly, N must be the sum of an x(i) and a y(j). So here comes the clever idea, which results in computing few sums  instead of computing ALL the sums of an x(i) and a y(j), as in a cartesian product. (If we had to do that, then the number of operations would be exactly that of the brute force solution  no savings from that!)
So, what is that clever idea? Let's say the array y(*) has K elements. Set i = 1, j = K. Compute x(1) + y(K). If it's too small (less than N), then increment i by 1. If it's too big (greater than N), then decrement j by 1. Of course, if x(1) + y(K) is exactly equal to N, we are done. Then continue  at each step compare x(i) + y(j) with N; either we are done, or increase i by 1, or decrease j by 1. If we run out of i's OR j's at any step, we stop: there is no solution.
In math, this assertion requires proof. As a programmer, you can either accept that on faith, or you may want to see the proof. The proof is not too difficult; it is by mathematical induction, and the inductive step is the same as the initial step. The key is this: Suppose x(1) + y(K) < N. Since y(K) is the MAXIMUM element of the y array, x(1) + y(j) will be < N for ALL j; so we know x(1) won't ever help us solve the problem. So we may as well discard it  we now have a shorter array x. Repeat.
Now, to make this work, we need to compute the subset sums for A and B AND have the sums arranged in ascending order. This latter part is important: if we collect the subset sums with no regard to ordering, and leave the ordering for the end, the ordering itself will require too many steps  we'll get back closer to "no help from this approach".
However, we can deal with that as follows: Suppose we computed the subset sums for elements a_1, ... , a_m of subset A, and the subset sums are already in an ORDERED (ascending) array. Then we add a_(m+1) to each of these sums; we get a new array, and this new array is also ORDERED. Now we can combine the previous array and this new one; ordering it is relatively easy, because we must MERGE two arrays that are ALREADY ORDERED. So somewhere in the implementation we must make sure we take care of this step.
One more thing I did in my implementation was to cut off the sums when they become greater than the target number. This is reflected in the POSITION() method in the code below.
So, here is the implementation. In the end, all I need is the static method SSS_OBJ.SUBSET_SUM(x, N) where x is an array of records (ID, VAL), representing the set X, and N is the target number. The method returns 'Not Found' if no subset sum equals N, or else it returns a + separated string of ID's of the elements whose values add up to N. To demonstrate how that works, I hardcoded the necessary array; if the data is in a table, it can be used to populate the array with a BULK COLLECT operation.

Here's the code:
/* drop type sss_obj; drop type sss_arr; drop type sss; /*/create or replace type sss as object( id varchar2(4000), val number);/create or replace type sss_arr as varray(2000000000) of sss;/create or replace type sss_obj as object( arr sss_arr, static function empty_set return sss_obj, member function position(x number) return integer, member function add_elem(elem sss, c number default null) return sss_obj /* Add a set element to the array of subsets (with subset sum), subject to the sum not exceeding the upper bound c. Call with the default NULL for c to add the element without restrictions. Call with sss := sss(null, 0) to delete elements where val > c. */, member function merge(oth sss_obj) return sss_obj, static function subsets(x sss_arr, c number default null) return sss_arr, static function two_terms(x sss_arr, y sss_arr, n number) return varchar2, static function subset_sum(x sss_arr, n number) return varchar2);/create or replace type body sss_obj as static function empty_set return sss_obj is begin return sss_obj(sss_arr(sss(null, 0))); end; member function position(x number) return integer is lb integer := 1; ub integer := self.arr.count; n integer; begin if x >= self.arr(lb).val then n := ceil((lb + ub)/2); while lb < ub loop if x >= self.arr(n).val then lb := n; else ub := n  1; end if; n := ceil((lb + ub)/2); end loop; else n := 0; end if; return n; end; member function add_elem(elem sss, c number default null) return sss_obj is new_arr sss_arr := sss_arr(); begin new_arr.extend(case when c is null then self.arr.count else self.position(celem.val) end); for i in 1 .. new_arr.count loop new_arr(i) := sss(self.arr(i).id  '+'  elem.id, self.arr(i).val + elem.val); end loop; return sss_obj(new_arr); end; member function merge(oth sss_obj) return sss_obj is b sss_arr := sss_arr(); i integer := 1; j integer := 1; n integer := 1; begin b.extend(self.arr.count + oth.arr.count); while i <= self.arr.count and j <= oth.arr.count loop if self.arr(i).val <= oth.arr(j).val then b(n) := self.arr(i); i := i + 1; else b(n) := oth.arr(j); j := j+1; end if; n := n + 1; end loop; for s in i .. self.arr.count loop b(s + oth.arr.count) := self.arr(s); end loop; for t in j .. oth.arr.count loop b(t + self.arr.count) := oth.arr(t); end loop; return sss_obj(b); end; static function subsets(x sss_arr, c number default null) return sss_arr is a sss_obj := sss_obj.empty_set; begin for i in 1 .. x.count loop a := a.merge(a.add_elem(x(i), c)); end loop; return a.arr; end; static function two_terms(x sss_arr, y sss_arr, n number) return varchar2 is str varchar2(4000) := 'Not found'; i integer := 1; j integer := y.count; begin while i <= x.count and j >= 1 loop if x(i).val + y(j).val < n then i := i + 1; elsif x(i).val + y(j).val > n then j := j  1; else str := x(i).id  y(j).id; exit; end if; end loop; return str; end; static function subset_sum(x sss_arr, n number) return varchar2 is c integer := ceil(x.count/2); d integer := x.count  c; y sss_arr := sss_arr(); z sss_arr := sss_arr(); begin y.extend(c); z.extend(d); for i in 1 .. c loop y(i) := x(i); end loop; for i in 1 .. d loop z(i) := x(c + i); end loop; return sss_obj.two_terms(sss_obj.subsets(y, n), sss_obj.subsets(z, n), n); end;end;/
EDIT
I added the code highlighted in bold and blue, to implement fully the optimization that cuts off partial sums when they exceed the target. Initially I had done that in the ADD_ELEM method, but I did not follow through in the SUBSETS method and in the calls to SUBSETS at the end of the SUBSET_SUM static method. These are fixed now.
END EDIT

And here is a short test. I include DBMS_OUTPUT calls to show the input list (of id  weight pairs), and then the last line shows the target value N and the rows that together add up to N (identified by their id's, not by their weights).
declare x sss_arr := sss_arr(); n number := 0;begin x.extend(26); for i in 1 .. 26 loop x(i) := sss(chr(96+i), power(2,i)); dbms_output.put_line(x(i).id  ' ' x(i).val); if mod(i, 3) = 0 then n := n + power(2, i); end if; end loop; dbms_output.put_line(n  ' '  sss_obj.subset_sum(x,n));end;/a 2b 4c 8d 16e 32f 64g 128h 256i 512j 1024k 2048l 4096m 8192n 16384o 32768p 65536q 131072r 262144s 524288t 1048576u 2097152v 4194304w 8388608x 16777216y 33554432z 6710886419173960 +c+f+i+l+o+r+u+xPL/SQL procedure successfully completed.Elapsed: 00:00:00.237

Hmmm... I didn't state the problem very carefully, and this leads to possible confusion.
In combinatorics (pure math) the problem is stated for INTEGERS, not for arbitrary numbers. I formulated the problem for "numbers" in general. And indeed, for the brute force algorithm, as well as for the HorowitzSahni algorithm, it makes no difference.
However, some algorithms take advantage specifically of the fact that the numbers are integers. And that is the case with Pisinger's algorithm (and others). In the earlier thread on this forum, the input numbers had two decimal places of precision. Now, in accounting (and in computing) requiring that numbers be integers is not a constraint at all. In accounting, amounts are integer multiples of one cent (or one penny etc.). In computing, they are integer multiples of power(2, 32) or power(2, 64) or whatever, depending on how numbers are implemented. (Perhaps power(10, 38) when using library functions to implement decimal numbers with high precision.)
But this means that a known upper bound of C = 3 (say) is really an upper bound of C = 3 * power(2, 32) (or whatever; C = 300 if the upper bound is "three dollars" and the integer values are "in cents". And the complexity of Pisinger's algorithm is O(nC) where n is the cardinality of the set X and C is the upper bound for the INTEGERS considered in the problem; if C is 3 * power(2, 32), that may not be much help.
In any case, this is an important clarification: for any algorithms that take advantage specifically of the fact that the numbers are INTEGERS, we should accept that the integers are very large indeed.

My attempt is really a bruteforce approach, with various optimisations.
I take my multiset of X numbers from a table TAB2. I'll demonstrate with eight numbers that are all powers of 2:
drop table tab2 purge;create table tab2( item number generated always as identity primary key, distributed_number number not null);insert into tab2(distributed_number)select power(2,level) from dualconnect by level <= 8;
I start by populating a PL/SQL collection from the following query:
select item, distributed_number num, min(distributed_number) over() min_num, sum(distributed_number) over(order by distributed_number rows unbounded preceding) remaining_sumfrom tab2where distributed_number <= p_target_sumorder by 2 desc, 4 desc;
ITEM NUM MIN_NUM REMAINING_SUM 8 256 2 510 7 128 2 254 6 64 2 126 5 32 2 62 4 16 2 30 3 8 2 14 2 4 2 6 1 2 2 2 The first optimization is to omit any number that is already too big for the target.
The processing is done by a RECURSE() procedure that takes as input the current index in the collection, and the sum found so far (in P_CUMUL_SUM).
 if the incoming p_cumul_sum >= the target sum then there is a logic error in my code and I abort.
 MIN_NUM is the smallest number in X (the input multiset). If p_cumul_sum + min_num > the target sum, then this path will never find a solution, so return.
 if p_cumul_sum + min_sum = the target sum then we have a solution.
 REMAINING_SUM is the sum of all remaining values in the collection, including the current location. if p_cumul_sum + remaining_sum = the target then we have a solution.
 if p_cumul_sum + remaining_sum < the target then this path will never find a solution, so return.
 find following num where p_cumul_sum + that number <= the target.
 recurse with that number added to p_cumul_sum
 if no answer is found, recurse without that number added.
 Additional optimisation: omit the "recurse without" step if the next element in the collection has the same number as this element.
NOTE: there was a bug in this optimisation which caused me to skip some recursions and generate false negatives.
 Additional optimisation: omit the "recurse without" step if the next element in the collection has the same number as this element.
Basically, this remains a "brute force" algorithm, except that it can sometimes find a solution early and can often avoid unnecessary recursions. In almost all my tests, when an answer is found it is found quickly. When the algorithm runs a long time without completing, that often (but not always) means that there is no solution.
Here is the function. The target sum is an input parameter, but the multiset X is queried from table TAB2.
UPDATE 20180116: corrected bug around lines 9598, see NOTE above.
create or replace function subset_sum(p_target_sum in number) return number is cursor cur_tab2(p_target_sum number) is select item, distributed_number num, min(distributed_number) over() min_num, sum(distributed_number) over(order by distributed_number rows unbounded preceding) remaining_sum from tab2 where distributed_number <= p_target_sum order by 2 desc, 4 desc; type tt_tab2 is table of cur_tab2%rowtype; gt_tab2 tt_tab2; type tt_indexes is table of number; gt_indexes tt_indexes := new tt_indexes(); g_cumul_sum number := 0; g_num_calls number := 0;/* if incoming p_cumul_sum >= p_target_sum then internal error if p_cumul_sum + min_num > p_target_sum then return if p_cumul_sum + min_sum = p_target_sum then SOLVED if p_cumul_sum + remaining_sum = p_target_sum then SOLVED if p_cumul_sum + remaining_sum < p_target_sum then return find following num where p_cumul_sum + it <= p_target_sum*/ procedure recurse(p_row_done in number, p_cumul_sum in out number) is l_row_do number := p_row_done + 1; l_cumul_sum number; begin  limit total number of recursions g_num_calls := g_num_calls + 1; if g_num_calls >= 1000000 then dbms_output.put_line('Aborted after 1000000 recursions, p_row_done ='p_row_done); return; end if; dbms_output.put_line(p_row_done':'p_cumul_sum);  if incoming p_cumul_sum >= p_target_sum then internal error if p_cumul_sum >= p_target_sum then dbms_output.put_line( 'Internal error: p_cumul_sum ('p_cumul_sum') >= p_target_sum ('p_target_sum')' ); return; end if;  if p_cumul_sum + min_num > p_target_sum then return if l_row_do > gt_tab2.count or p_cumul_sum + gt_tab2(l_row_do).min_num > p_target_sum then return; end if;  if p_cumul_sum + min_sum = p_target_sum then SOLVED if p_cumul_sum + gt_tab2(l_row_do).min_num = p_target_sum then gt_indexes.extend; gt_indexes(gt_indexes.last) := gt_tab2.last; p_cumul_sum := p_target_sum; return; end if;  if p_cumul_sum + remaining_sum = p_target_sum then SOLVED if p_cumul_sum + gt_tab2(l_row_do).remaining_sum = p_target_sum then p_cumul_sum := p_target_sum; for i in l_row_do..gt_tab2.last loop gt_indexes.extend; gt_indexes(gt_indexes.last) := i; end loop; return; end if;  if p_cumul_sum + remaining_sum < p_target_sum then return if p_cumul_sum + gt_tab2(l_row_do).remaining_sum < p_target_sum then return; end if;  find following num where p_cumul_sum + it <= p_target_sum while p_cumul_sum + gt_tab2(l_row_do).num > p_target_sum loop l_row_do := l_row_do + 1; end loop; p_cumul_sum := p_cumul_sum + gt_tab2(l_row_do).num; gt_indexes.extend; gt_indexes(gt_indexes.last) := l_row_do; if p_cumul_sum = p_target_sum then return; end if;  recurse with following num, then without it recurse(l_row_do, p_cumul_sum); if p_cumul_sum != p_target_sum then gt_indexes.trim; p_cumul_sum := p_cumul_sum  gt_tab2(l_row_do).num;  Skip following rows that have same num declare l_check_num number := gt_tab2(l_row_do).num; begin while l_check_num = gt_tab2(l_row_do+1).num and l_row_do < gt_tab2.count loop l_row_do := l_row_do + 1; end loop; recurse(l_row_do, p_cumul_sum); end; end if; end recurse;begin open cur_tab2(p_target_sum); fetch cur_tab2 bulk collect into gt_tab2; close cur_tab2; if gt_tab2.count = 0 then raise_application_error(20000, 'TAB2 table has no rows'); end if; recurse(0,g_cumul_sum); dbms_output.put_line(g_cumul_sum ':'p_target_sum); dbms_output.put_line('Total number of recursions='g_num_calls); if g_cumul_sum = p_target_sum then for i in 1..gt_indexes.count loop dbms_output.put_line(gt_indexes(i) ':'gt_tab2(gt_indexes(i)).num); end loop; null; end if; return g_num_calls;end;/
Best regards,
Stew Ashton

Thank you for posting this. I haven't read the code yet, but I started playing with the function. One change I made immediately was to make the G_NUM_CALLS cutoff into an input to the function, so I can see what happens when a solution is not found very quickly.
It is easy to construct examples with relatively small samples, where the algorithm will have to work very hard (to find a subset sum  or to find conclusively that there is no match). I started with a table of 50 numbers between 1000 and 1000.01, generated randomly (and rounded to six decimal places). Then I take the sum of 8, 10 or 12 etc. of those 50 numbers, by selecting SUM(DISTRIBUTED_NUMBER) over ITEM in a list of values. If the list is such that many of the largest values are included in the sum, the function returns quickly. However, it takes a minute to return the answer when the target is the sum of the ten smallest numbers out of the 50.
Question for you (perhaps I could answer it by reading the code, too): I tried the sum of the 12 lowest numbers, and I set the cutoff at 500 million recursions. The program abandoned after about two minutes, without the usual list of "Aborted after .... recursions" rows; instead, here is all I got:
SUBSET_SUM(12000.128539,500000000) 1693445870:12000.128539Total number of recursions=169344587
Any thoughts as to why this happened? I tried it twice and I got EXACTLY the same answer. Are there any recursive calls in your function? I know you say "recurse" but is there a function somewhere that really calls itself recursively? Could there be an issue with the depth of a call stack somewhere?
For what it's worth, I don't think any implementation of a brute force approach, no matter how much it is optimized, will be efficient enough in general. We may be able to find quickly those targets that are the sum of a small number of summands, but not much more. In a sample like mine, where 50 summands are all bunched between 1000 and 1000.01, I don't see any way that a NEGATIVE answer (the target is not a subset sum for the given summands) can ever be obtained quickly.
Also regarding the optimizations: if you are interested in the HorowitzSahni algorithm, the way I implemented it, it is quite rudimentary but I did use all those optimizations, obviously. It may not be obvious where or how; it's in the member function ADD_ELEMENT, where I cut off any sum that is greater than the target. This includes as a trivial special case the summands that are themselves greater than the target.
Cheers,  mathguy

mathguy wrote:...Question for you (perhaps I could answer it by reading the code, too): I tried the sum of the 12 lowest numbers, and I set the cutoff at 500 million recursions. The program abandoned after about two minutes, without the usual list of "Aborted after .... recursions" rows; instead, here is all I got:SUBSET_SUM(12000.128539,500000000) 1693445870:12000.128539Total number of recursions=169344587Any thoughts as to why this happened? I tried it twice and I got EXACTLY the same answer. Are there any recursive calls in your function? I know you say "recurse" but is there a function somewhere that really calls itself recursively? Could there be an issue with the depth of a call stack somewhere?
I really have no clue.
There is a procedure called recurse() starting in line 26 of the code I posted.
Regards,
Stew

I tried something else  I called the function with the target 12000 (round number)  I know that this target cannot be achieved, since all the numbers are strictly between 1000 and 1000.01. And I got almost the same answer:
SUBSET_SUM(12000,500000000) 1693446290:12000Total number of recursions=169344629
It seems more likely that this is the behavior when the function finds conclusively that the target is NOT a subset sum for the given inputs. It is interesting that the number of iterations is very similar, yet not identical; that must be due to some sums being eliminated sooner with one target vs. the other. (Although I would have expected the number of recursions to be lower for the smaller target; in fact it is the opposite.)
But if that is so, then the answer I got earlier is incorrect. I called the function specifically with the result of summing 12 different rows from TAB2; so if the program finds conclusively that the target is NOT a subset sum for the given inputs, something is wrong.
I will take a closer look at the code, to see if there is any reason it may give out false negatives.
Best,  mathguy