Forum Stats

  • 3,784,147 Users
  • 2,254,897 Discussions
  • 7,880,711 Comments

Discussions

Subset sum problem

mathguy
mathguy Member Posts: 10,229 Blue Diamond
edited Feb 17, 2019 12:18PM in SQL & PL/SQL

I apologize in advance if my question is not appropriate for this forum, as it is primarily a question about algorithms rather than specific implementation in PL/SQL and/or Oracle SQL. (Although those matter too, of course.)

The subset sum problem is a classical, well-known and difficult combinatorial problem. https://en.wikipedia.org/wiki/Subset_sum_problem   It has several more-or-less equivalent formulations; I am interested specifically in the following one:

We are given a set X of positive numbers (possibly with duplicates - that doesn't make the problem any harder) and a positive number N. We need to find a subset Y of X so that the sum of elements in Y is exactly N, or if no such subset exists, we need to know that information. If more than one such subset Y exists, we will be satisfied with one of them (it doesn't matter which one).

Example:     X = {3, 8, 22, 40, 17, 32, 4}  and    N = 47.    In this case the subset Y = {3, 4, 40} is good. So is Y = {8, 22, 17} but we don't care - as soon as we found ONE solution we may declare victory.

One way to approach this problem is to build all the possible subsets of X, compute the sum, compare to N. This is easy to understand and to code, but if the cardinality of X is C, then there are power(2, C) subsets, and this may get out of hand quickly.

The Wikipedia article I linked to above discusses several other approaches to this problem (in addition to the brute force approach of simply considering all the subsets). This morning I implemented in PL/SQL the Horowitz-Sahni algorithm described there; I find that it can handle, for example, a set X with 45 elements and find a subset with a given sum N (a specific value which is the sum of a subset of 15 elements) in about 30 seconds.

I am interested in other algorithms and/or implementations of those algorithms in PL/SQL.

In particular, @Stew Ashton, in a different thread ( ) mentioned that he has a method that can find a number that is the sum of a subset of 20 elements out of a set X of 60 elements in 0.002 seconds. I would be very interested both in the algorithm and the implementation for that solution.

My original intent was to post my PL/SQL implementation of the Horowitz-Sahni algorithm in this thread, and ask for critique/suggestions for optimization etc.; however, if much better/faster methods exist, that would be a waste of everybody's time (not mine for developing it - I learned a lot of PL/SQL in the process - but that only benefits me).

Dejan T.
«134

Answers

  • Ranagal
    Ranagal Member Posts: 635 Bronze Badge
    edited Jan 14, 2019 6:36PM

    Hi mathguy,

    Posting your PL/SQL solution would definitely benefit others too on this forum. Kindly request you to post the same. As people already know that implementation of Horowitz-Sahni algorithm has certain limitations (as descried by you),  it's up to the users to take it.

    Regards,

    Ranagal

  • Paulzip
    Paulzip Member Posts: 8,545 Blue Diamond
    edited Jan 14, 2019 7:18PM

    Pisinger's algorithm is supposed to be linear time "if set members are positive and bounded by the same constant."

    mathguy
  • mathguy
    mathguy Member Posts: 10,229 Blue Diamond
    edited Jan 14, 2019 7:20PM

    Thank you for your interest.

    The algorithm is based on the following idea:

    Suppose we write X as a disjoint union of two subsets, A and B. That means: X is the (multiset) union of A and B; and every element of X is either in A or B but not in both. A bit of care must be taken here with multisets (duplicate values); the way I like to think of it - and that is how I implemented - is that elements of X have both a unique identified and a weight (or a value). We distinguish the elements from each other by their unique id, but different elements are allowed to have the same weight, and the subset sum problem is expressed in terms of weights.

    So, continuing with the idea: Suppose X is the disjoint union of A and B. It doesn't matter how we choose A and B, the argument below still applies; in practice, to minimize the number of required computations, it helps if A and B each have half the elements of X (or approximately that, if the cardinality of X is an odd number).

    Now suppose we compute the subset sums for all subsets of A, and put the sums in an ascending array x(1), x(1), ...  and then we do the same thing with all the subsets of B and put the subset sums in an ascending array y(1), y(2), ...  Note that if X has, say, 40 elements, then it has power(2, 40) subsets. However, A and B each has 20 elements (if we choose them that way), so we only need to compute 2 * power(2, 20) sums - a much, much lower number.

    Now, if a subset Y of X has the subset sum equal to the target number N, then Y is the disjoint union of (Y intersect A) and (Y intersect B). Correspondingly, N must be the sum of an x(i) and a y(j). So here comes the clever idea, which results in computing few sums - instead of computing ALL the sums of an x(i) and a y(j), as in a cartesian product. (If we had to do that, then the number of operations would be exactly that of the brute force solution - no savings from that!)

    So, what is that clever idea? Let's say the array y(*) has K elements.  Set i = 1, j = K.  Compute x(1) + y(K). If it's too small (less than N), then increment i by 1. If it's too big (greater than N), then decrement j by 1. Of course, if x(1) + y(K) is exactly equal to N, we are done. Then continue - at each step compare x(i) + y(j) with N; either we are done, or increase i by 1, or decrease j by 1. If we run out of i's OR j's at any step, we stop: there is no solution.

    In math, this assertion requires proof. As a programmer, you can either accept that on faith, or you may want to see the proof. The proof is not too difficult; it is by mathematical induction, and the inductive step is the same as the initial step. The key is this: Suppose x(1) + y(K) < N. Since y(K) is the MAXIMUM element of the y array, x(1) + y(j) will be < N for ALL j; so we know x(1) won't ever help us solve the problem. So we may as well discard it - we now have a shorter array x. Repeat.

    Now, to make this work, we need to compute the subset sums for A and B AND have the sums arranged in ascending order. This latter part is important: if we collect the subset sums with no regard to ordering, and leave the ordering for the end, the ordering itself will require too many steps - we'll get back closer to "no help from this approach".

    However, we can deal with that as follows: Suppose we computed the subset sums for elements a_1, ... , a_m of subset A, and the subset sums are already in an ORDERED (ascending) array. Then we add a_(m+1) to each of these sums; we get a new array, and this new array is also ORDERED. Now we can combine the previous array and this new one; ordering it is relatively easy, because we must MERGE two arrays that are ALREADY ORDERED. So somewhere in the implementation we must make sure we take care of this step.

    One more thing I did in my implementation was to cut off the sums when they become greater than the target number. This is reflected in the POSITION() method in the code below.

    So, here is the implementation. In the end, all I need is the static method SSS_OBJ.SUBSET_SUM(x, N) where x is an array of records (ID, VAL), representing the set X, and N is the target number. The method returns 'Not Found' if no subset sum equals N, or else it returns a + separated string of ID's of the elements whose values add up to N. To demonstrate how that works, I hard-coded the necessary array; if the data is in a table, it can be used to populate the array with a BULK COLLECT operation.

  • mathguy
    mathguy Member Posts: 10,229 Blue Diamond
    edited Jan 16, 2019 11:27AM

    Here's the code:

    /*    drop type sss_obj;    drop type sss_arr;    drop type sss;    /--*/create or replace type sss as object( id  varchar2(4000), val number);/create or replace type sss_arr as varray(2000000000) of sss;/create or replace type sss_obj as object( arr sss_arr, static function empty_set return sss_obj, member function position(x number) return integer, member function add_elem(elem sss, c number default null) return sss_obj    /*        Add a set element to the array of subsets (with subset sum), subject to        the sum not exceeding the upper bound c.        Call with the default NULL for c to add the element without restrictions.        Call with sss := sss(null, 0) to delete elements where val > c.    */, member function merge(oth sss_obj) return sss_obj, static function subsets(x sss_arr, c number default null) return sss_arr, static function two_terms(x sss_arr, y sss_arr, n number) return varchar2, static function subset_sum(x sss_arr, n number) return varchar2);/create or replace type body sss_obj as  static function empty_set return sss_obj is  begin    return sss_obj(sss_arr(sss(null, 0)));  end;  member function position(x number) return integer  is    lb integer := 1;    ub integer := self.arr.count;    n  integer;  begin    if x >= self.arr(lb).val then      n := ceil((lb + ub)/2);      while lb < ub loop        if x >= self.arr(n).val then          lb := n;        else          ub := n - 1;        end if;        n := ceil((lb + ub)/2);      end loop;    else      n := 0;    end if;    return n;  end;  member function add_elem(elem sss, c number default null) return sss_obj is    new_arr sss_arr := sss_arr();  begin    new_arr.extend(case when c is null then self.arr.count else self.position(c-elem.val) end);    for i in 1 .. new_arr.count loop      new_arr(i) := sss(self.arr(i).id || '+' || elem.id, self.arr(i).val + elem.val);    end loop;    return sss_obj(new_arr);  end;  member function merge(oth sss_obj) return sss_obj is    b sss_arr := sss_arr();    i integer := 1;    j integer := 1;    n integer := 1;  begin    b.extend(self.arr.count + oth.arr.count);    while i <= self.arr.count and j <= oth.arr.count loop      if self.arr(i).val <= oth.arr(j).val then        b(n) := self.arr(i);        i := i + 1;      else        b(n) := oth.arr(j);        j := j+1;      end if;      n := n + 1;    end loop;    for s in i .. self.arr.count loop      b(s + oth.arr.count) := self.arr(s);    end loop;    for t in j .. oth.arr.count loop      b(t + self.arr.count) := oth.arr(t);    end loop;    return sss_obj(b);  end;  static function subsets(x sss_arr, c number default null) return sss_arr is    a sss_obj := sss_obj.empty_set;  begin    for i in 1 .. x.count loop      a := a.merge(a.add_elem(x(i), c));    end loop;    return a.arr;  end;  static function two_terms(x sss_arr, y sss_arr, n number) return varchar2 is    str varchar2(4000) := 'Not found';    i integer := 1;    j integer := y.count;  begin    while i <= x.count and j >= 1 loop      if    x(i).val + y(j).val < n then        i := i + 1;      elsif x(i).val + y(j).val > n then        j := j - 1;      else        str := x(i).id || y(j).id;        exit;      end if;    end loop;    return str;  end;  static function subset_sum(x sss_arr, n number) return varchar2 is    c integer := ceil(x.count/2);    d integer := x.count - c;    y sss_arr := sss_arr();    z sss_arr := sss_arr();  begin    y.extend(c);    z.extend(d);    for i in 1 .. c loop      y(i) := x(i);    end loop;    for i in 1 .. d loop      z(i) := x(c + i);    end loop;    return sss_obj.two_terms(sss_obj.subsets(y, n), sss_obj.subsets(z, n), n);  end;end;/

    EDIT

    I added the code highlighted in bold and blue, to implement fully the optimization that cuts off partial sums when they exceed the target. Initially I had done that in the ADD_ELEM method, but I did not follow through in the SUBSETS method and in the calls to SUBSETS at the end of the SUBSET_SUM static method. These are fixed now.

    END EDIT

  • mathguy
    mathguy Member Posts: 10,229 Blue Diamond
    edited Jan 14, 2019 7:29PM

    And here is a short test. I include DBMS_OUTPUT calls to show the input list (of id - weight pairs), and then the last line shows the target value N and the rows that together add up to N (identified by their id's, not by their weights).

    declare  x sss_arr := sss_arr();  n number := 0;begin  x.extend(26);  for i in 1 .. 26 loop    x(i) := sss(chr(96+i), power(2,i));    dbms_output.put_line(x(i).id || ' '|| x(i).val);    if mod(i, 3) = 0 then      n := n + power(2, i);    end if;  end loop;  dbms_output.put_line(n || ' ' || sss_obj.subset_sum(x,n));end;/a 2b 4c 8d 16e 32f 64g 128h 256i 512j 1024k 2048l 4096m 8192n 16384o 32768p 65536q 131072r 262144s 524288t 1048576u 2097152v 4194304w 8388608x 16777216y 33554432z 6710886419173960 +c+f+i+l+o+r+u+xPL/SQL procedure successfully completed.Elapsed: 00:00:00.237
  • mathguy
    mathguy Member Posts: 10,229 Blue Diamond
    edited Jan 14, 2019 7:45PM

    Hmmm... I didn't state the problem very carefully, and this leads to possible confusion.

    In combinatorics (pure math) the problem is stated for INTEGERS, not for arbitrary numbers. I formulated the problem for "numbers" in general. And indeed, for the brute force algorithm, as well as for the Horowitz-Sahni algorithm, it makes no difference.

    However, some algorithms take advantage specifically of the fact that the numbers are integers. And that is the case with Pisinger's algorithm (and others). In the earlier thread on this forum, the input numbers had two decimal places of precision. Now, in accounting (and in computing) requiring that numbers be integers is not a constraint at all. In accounting, amounts are integer multiples of one cent (or one penny etc.). In computing, they are integer multiples of power(2, -32) or power(2, -64) or whatever, depending on how numbers are implemented. (Perhaps power(10, -38) when using library functions to implement decimal numbers with high precision.)

    But this means that a known upper bound of C = 3 (say) is really an upper bound of C = 3 * power(2, 32) (or whatever; C = 300 if the upper bound is "three dollars" and the integer values are "in cents". And the complexity of Pisinger's algorithm is O(nC) where n is the cardinality of the set X and C is the upper bound for the INTEGERS considered in the problem; if C is 3 * power(2, 32), that may not be much help.

    In any case, this is an important clarification: for any algorithms that take advantage specifically of the fact that the numbers are INTEGERS, we should accept that the integers are very large indeed.

  • Stew Ashton
    Stew Ashton Member Posts: 2,864 Bronze Crown
    edited Jan 16, 2019 4:48AM

    My attempt is really a brute-force approach, with various optimisations.

    I take my multiset of X numbers from a table TAB2. I'll demonstrate with eight numbers that are all powers of 2:

    drop table tab2 purge;create table tab2(  item number generated always as identity primary key,  distributed_number number not null);insert into tab2(distributed_number)select power(2,level) from dualconnect by level <= 8;

    I start by populating a PL/SQL collection from the following query:

    select item,   distributed_number num,  min(distributed_number) over() min_num,  sum(distributed_number)     over(order by distributed_number rows unbounded preceding) remaining_sumfrom tab2--where distributed_number <= p_target_sumorder by 2 desc, 4 desc;

    ITEMNUMMIN_NUMREMAINING_SUM
    82562510
    71282254
    6642126
    532262
    416230
    38214
    2426
    1222

    The first optimization is to omit any number that is already too big for the target.

    The processing is done by a RECURSE() procedure that takes as input the current index in the collection, and the sum found so far (in P_CUMUL_SUM).

    • if the incoming p_cumul_sum >= the target sum then there is a logic error in my code and I abort.
    • MIN_NUM is the smallest number in X (the input multiset). If p_cumul_sum + min_num > the target sum, then this path will never find a solution, so return.
    • if p_cumul_sum + min_sum = the target sum then we have a solution.
    • REMAINING_SUM is the sum of all remaining values in the collection, including the current location. if p_cumul_sum + remaining_sum = the target then we have a solution.
    • if p_cumul_sum + remaining_sum < the target then this path will never find a solution, so return.
    • find following num where p_cumul_sum + that number <= the target.
      • recurse with that number added to p_cumul_sum
      • if no answer is found, recurse without that number added.
        • Additional optimisation: omit the "recurse without" step if the next element in the collection has the same number as this element.
          NOTE: there was a bug in this optimisation which caused me to skip some recursions and generate false negatives.

    Basically, this remains a "brute force" algorithm, except that it can sometimes find a solution early and can often avoid unnecessary recursions. In almost all my tests, when an answer is found it is found quickly. When the algorithm runs a long time without completing, that often (but not always) means that there is no solution.

    Here is the function. The target sum is an input parameter, but the multiset X is queried from table TAB2.

    UPDATE 2018-01-16: corrected bug around lines 95-98, see NOTE above.

    create or replace function subset_sum(p_target_sum in number) return number is  cursor cur_tab2(p_target_sum number) is     select item,       distributed_number num,      min(distributed_number) over() min_num,      sum(distributed_number)         over(order by distributed_number rows unbounded preceding) remaining_sum    from tab2    where distributed_number <= p_target_sum    order by 2 desc, 4 desc;  type tt_tab2 is table of cur_tab2%rowtype;  gt_tab2 tt_tab2;  type tt_indexes is table of number;  gt_indexes tt_indexes := new tt_indexes();  g_cumul_sum number := 0;  g_num_calls number := 0;/*- if incoming p_cumul_sum >= p_target_sum then internal error- if p_cumul_sum + min_num > p_target_sum then return- if p_cumul_sum + min_sum = p_target_sum then SOLVED- if p_cumul_sum + remaining_sum = p_target_sum then SOLVED- if p_cumul_sum + remaining_sum < p_target_sum then return- find following num where p_cumul_sum + it <= p_target_sum*/  procedure recurse(p_row_done in number, p_cumul_sum in out number) is    l_row_do number := p_row_done + 1;    l_cumul_sum number;  begin    -- limit total number of recursions    g_num_calls := g_num_calls + 1;    if g_num_calls >= 1000000 then      dbms_output.put_line('Aborted after 1000000 recursions, p_row_done ='||p_row_done);      return;    end if;    --dbms_output.put_line(p_row_done||':'||p_cumul_sum);      -- if incoming p_cumul_sum >= p_target_sum then internal error        if p_cumul_sum >= p_target_sum then      dbms_output.put_line(        'Internal error: p_cumul_sum ('||p_cumul_sum||') >= p_target_sum ('||p_target_sum||')'      );      return;    end if;  -- if p_cumul_sum + min_num > p_target_sum then return    if l_row_do > gt_tab2.count       or p_cumul_sum + gt_tab2(l_row_do).min_num > p_target_sum      then return;    end if;    -- if p_cumul_sum + min_sum = p_target_sum then SOLVED    if p_cumul_sum + gt_tab2(l_row_do).min_num = p_target_sum then      gt_indexes.extend;      gt_indexes(gt_indexes.last) := gt_tab2.last;      p_cumul_sum := p_target_sum;      return;    end if;  -- if p_cumul_sum + remaining_sum = p_target_sum then SOLVED    if p_cumul_sum + gt_tab2(l_row_do).remaining_sum = p_target_sum then      p_cumul_sum := p_target_sum;      for i in l_row_do..gt_tab2.last loop        gt_indexes.extend;        gt_indexes(gt_indexes.last) := i;      end loop;      return;    end if;  -- if p_cumul_sum + remaining_sum < p_target_sum then return    if p_cumul_sum + gt_tab2(l_row_do).remaining_sum < p_target_sum then      return;    end if;  -- find following num where p_cumul_sum + it <= p_target_sum    while p_cumul_sum + gt_tab2(l_row_do).num > p_target_sum loop      l_row_do := l_row_do + 1;    end loop;    p_cumul_sum := p_cumul_sum + gt_tab2(l_row_do).num;    gt_indexes.extend;    gt_indexes(gt_indexes.last) := l_row_do;    if p_cumul_sum = p_target_sum then      return;    end if;        -- recurse with following num, then without it    recurse(l_row_do, p_cumul_sum);    if p_cumul_sum != p_target_sum then      gt_indexes.trim;      p_cumul_sum := p_cumul_sum - gt_tab2(l_row_do).num;      -- Skip following rows that have same num      declare        l_check_num number := gt_tab2(l_row_do).num;      begin        while l_check_num = gt_tab2(l_row_do+1).num           and l_row_do < gt_tab2.count loop          l_row_do := l_row_do + 1;        end loop;        recurse(l_row_do, p_cumul_sum);      end;    end if;  end recurse;begin  open cur_tab2(p_target_sum);  fetch cur_tab2 bulk collect into gt_tab2;  close cur_tab2;  if gt_tab2.count = 0 then    raise_application_error(-20000, 'TAB2 table has no rows');  end if;  recurse(0,g_cumul_sum);  dbms_output.put_line(g_cumul_sum ||':'||p_target_sum);  dbms_output.put_line('Total number of recursions='||g_num_calls);  if g_cumul_sum = p_target_sum then    for i in 1..gt_indexes.count loop      dbms_output.put_line(gt_indexes(i) ||':'||gt_tab2(gt_indexes(i)).num);    end loop;    null;  end if;  return g_num_calls;end;/

    Best regards,

    Stew Ashton

  • mathguy
    mathguy Member Posts: 10,229 Blue Diamond
    edited Jan 15, 2019 11:06AM

    Thank you for posting this. I haven't read the code yet, but I started playing with the function. One change I made immediately was to make the G_NUM_CALLS cutoff into an input to the function, so I can see what happens when a solution is not found very quickly.

    It is easy to construct examples with relatively small samples, where the algorithm will have to work very hard (to find a subset sum - or to find conclusively that there is no match). I started with a table of 50 numbers between 1000 and 1000.01, generated randomly (and rounded to six decimal places). Then I take the sum of 8, 10 or 12 etc. of those 50 numbers, by selecting SUM(DISTRIBUTED_NUMBER) over ITEM in a list of values. If the list is such that many of the largest values are included in the sum, the function returns quickly. However, it takes a minute to return the answer when the target is the sum of the ten smallest numbers out of the 50.

    Question for you (perhaps I could answer it by reading the code, too): I tried the sum of the 12 lowest numbers, and I set the cutoff at 500 million recursions. The program abandoned after about two minutes, without the usual list of "Aborted after .... recursions" rows; instead, here is all I got:

    SUBSET_SUM(12000.128539,500000000)----------------------------------                        1693445870:12000.128539Total number of recursions=169344587

    Any thoughts as to why this happened? I tried it twice and I got EXACTLY the same answer. Are there any recursive calls in your function? I know you say "recurse" but is there a function somewhere that really calls itself recursively? Could there be an issue with the depth of a call stack somewhere?

    For what it's worth, I don't think any implementation of a brute force approach, no matter how much it is optimized, will be efficient enough in general. We may be able to find quickly those targets that are the sum of a small number of summands, but not much more. In a sample like mine, where 50 summands are all bunched between 1000 and 1000.01, I don't see any way that a NEGATIVE answer (the target is not a subset sum for the given summands) can ever be obtained quickly.

    Also regarding the optimizations: if you are interested in the Horowitz-Sahni algorithm, the way I implemented it, it is quite rudimentary but I did use all those optimizations, obviously. It may not be obvious where or how; it's in the member function ADD_ELEMENT, where I cut off any sum that is greater than the target. This includes as a trivial special case the summands that are themselves greater than the target.

    Cheers,    -    mathguy

  • Stew Ashton
    Stew Ashton Member Posts: 2,864 Bronze Crown
    edited Jan 15, 2019 11:59AM
    mathguy wrote:...Question for you (perhaps I could answer it by reading the code, too): I tried the sum of the 12 lowest numbers, and I set the cutoff at 500 million recursions. The program abandoned after about two minutes, without the usual list of "Aborted after .... recursions" rows; instead, here is all I got:SUBSET_SUM(12000.128539,500000000)---------------------------------- 1693445870:12000.128539Total number of recursions=169344587Any thoughts as to why this happened? I tried it twice and I got EXACTLY the same answer. Are there any recursive calls in your function? I know you say "recurse" but is there a function somewhere that really calls itself recursively? Could there be an issue with the depth of a call stack somewhere?

    I really have no clue.

    There is a procedure called recurse() starting in line 26 of the code I posted.

    Regards,

    Stew

  • mathguy
    mathguy Member Posts: 10,229 Blue Diamond
    edited Jan 15, 2019 12:03PM

    I tried something else - I called the function with the target 12000 (round number) - I know that this target cannot be achieved, since all the numbers are strictly between 1000 and 1000.01. And I got almost the same answer:

    SUBSET_SUM(12000,500000000)---------------------------                  1693446290:12000Total number of recursions=169344629

    It seems more likely that this is the behavior when the function finds conclusively that the target is NOT a subset sum for the given inputs. It is interesting that the number of iterations is very similar, yet not identical; that must be due to some sums being eliminated sooner with one target vs. the other. (Although I would have expected the number of recursions to be lower for the smaller target; in fact it is the opposite.)

    But if that is so, then the answer I got earlier is incorrect. I called the function specifically with the result of summing 12 different rows from TAB2; so if the program finds conclusively that the target is NOT a subset sum for the given inputs, something is wrong.

    I will take a closer look at the code, to see if there is any reason it may give out false negatives.

    Best,  -  mathguy