Implement set union of many small arrays

Maybe use the C++ STL? (not sure how performant it is, but it took me 30 min to write)

! unique_example.f90

program unique_example
implicit none
integer :: a(4), b(5)
integer, allocatable :: union(:)

a = [1,2,3,3]
b = [2,2,1,1,3]

!
! Modify in place
!
union = [a, b]
call make_unique(union)
print *, union

a = [5,5,5,5]
b = [3,3,3,3,1]

!
! Functional version (use an expression as input)
!
union = unique([a,b])
print *, union

contains

   subroutine make_unique(array)
      integer, intent(inout), allocatable :: array(:)
      interface
         ! void c_make_unique(int array[], int length, int *nunique);
         subroutine c_make_unique(array,length,nunique) bind(c,name="c_make_unique")
            use, intrinsic :: iso_c_binding, only: c_int
            integer(c_int), intent(inout) :: array(*)
            integer(c_int), intent(in), value :: length
            integer(c_int), intent(out) :: nunique
         end subroutine
      end interface
      integer :: n

      call c_make_unique(array,size(array),n)
      ! reallocation on assignment
      array = array(1:n)
   end subroutine

   function unique(array) result(ua)
      integer, intent(in) :: array(:)
      integer, allocatable :: ua(:)
      ua = array
      call make_unique(ua)
   end function

end program
/* c_make_unique.cxx */

#include <algorithm>
#include <span> // C++20

extern "C" {

void c_make_unique(int array[], int length, int *nunique){
  std::span a{array,static_cast<size_t>(length)};
  std::sort(a.begin(),a.end());
  auto last = std::unique(a.begin(),a.end());
  *nunique = last - a.begin();
}

} // extern "C"

Resulting output:

$ make
g++ -c -std=c++20 -Wall -o c_make_unique.o c_make_unique.cxx
gfortran -Wall -Wno-uninitialized -o unique_example unique_example.f90 c_make_unique.o
$ ./unique_example 
           1           2           3
           1           3           5

Oh wow, this is black magic to me, but I find fascinating that interoperability is so easy (once one knows what to do). Outstanding…

Well I solved for now with a humble implementation of a dynamic array (more or less inspired to std::vector, but far from being so smart), in which I first dump all values of the A input array, and then insert (one by one…) the values of input B, that are not found in A. Being A and B not sorted (and not sortable…) I suspect there’s little improvement to get on the checks. But I wonder if a huge preallocation, followed by a truncation (using pack) at the end would result in better performance.

Some details:

  1. My own poor std::vector.push_back:
   type xy_lattice
      !! Wrapper type for storing dynamically
      !! sized collections of lattice sites
      type(xy_site),allocatable  :: site(:)
   contains
      procedure :: push_back
   end type

   type, extends(xy) :: xy_site
      !! A 2D point extension for inequivalent sites
      !!     x,y   :: real-space coordinates
      !!     label :: A or B (sublattice)
      !!     key   :: for lattice lookup
      character(1) :: label
      integer      :: key=0
   end type

   type xy
      !! Base type for 2D points
      real(8) :: x
      real(8) :: y
   contains
      generic :: operator(==) => eq_xy
      generic :: operator(/=) => neq_xy
      procedure, private :: eq_xy
      procedure, private :: neq_xy
   end type

contains

  pure subroutine push_back(vec,val)
      !! Poor man implementation of a dynamic
      !! array, a là std::vector (but without
      !! preallocation and smart doubling...)
      class(xy_lattice),intent(inout)  :: vec
      type(xy_site),intent(in)         :: val
      type(xy_site),allocatable        :: tmp(:)
      integer                          :: len
      if (allocated(vec%site)) then
         len = size(vec%site)
         allocate(tmp(len+1))
         tmp(:len) = vec%site
         call move_alloc(tmp,vec%site)
      else
         len = 1
         allocate(vec%site(len))
      end if
      !PUSH val at the BACK
      vec%site(len+1) = val
  end subroutine push_back
  1. My union building procedure:
   pure function xy_ordered_union(A,B) result(C)
      type(xy_lattice),intent(in)   :: A
      type(xy_lattice),intent(in)   :: B
      type(xy_lattice)              :: C
      integer                       :: i,j
      C = A ! Copy all data of A into C
      do i = 1,size(B%site)
         if(all(B%site(i) /= A%site))then
            call C%push_back(B%site(i))
            ! Don't worry about this...
            j = size(C%site)
            C % site(j) % key = j 
         endif
      enddo
   end function
  1. The alternative with preallocation + pack:
   pure function xy_ordered_union(A,B) result(D)
      type(xy_lattice),intent(in)   :: A
      type(xy_lattice),intent(in)   :: B
      type(xy_lattice)              :: C,D
      integer                       :: i,j,NA,NB
      NA = size(A%site)
      NB = size(B%site)
      allocate(C%site(NA+NB))
      C%site = 0d0
      C%site(1:NA) = A ! copy all data of A into C
      do i = 1,size(B%site)
         if(all(B%site(i) /= A%site))then
            C%site(NA+i) = B%site(i)
            ! Don't worry about this...
            j = size(C%site)
            C % site(j) % key = j
         endif
      enddo
      ! Trim before returning to caller
      D%site = pack(C%site,C%site/=0) 
   end function

NA and NB are always equal to 6 [misleading, see later], but there are in general a lot of them, and this union should be performed eventually on all of them. In the average case the number of unique elements would be 2, s in almost 2/3 of cases I would not push_back in the former implementation. There are instead “boundary” choices of A or B for which almost all elements would be unique and inserted, but in the higher-load case the system has a lot of sites and so the boundary becomes negligible. Since the overall huge union of all small 6-element arrays is partitioned in so much small unions I am not sure preallocation would actually reduce that much the reallocations. Essentially in most cases I would

  • Call move_alloc for ≈ 1/3 of overall elements, but have the call overhead and the full copy of the NA elements every 12 ones. Or…

  • Preallocate everytime to 12 elements, and at the end, on average, trim to 4. All this with in mind let’s say hundreds of elements.

What do you think?

Ops, I failed to realize at first that the ratio of uniques converges fast to 1/3, rather then decreasing indefinitely, what a fail!

Anyway, maybe we should move these last three post two post of us (me and @ivanpribec) to a new thread… Who to ask?

If you do that in a naive way, then it would require NA*NB effort. Unless NA and NB are always small, that is too much.

Here is a suggestion. If you store each set in a binary search tree, then the merge would require only NB*log(NA) or NA*log(NB) effort. You get to choose the smaller of the two.

Yes, exactly. The problem is that NA and NB are always just 6 [MISLEADING, see reply…]. But there are a lot of them (A and B, i.e. I have to call repeatedly the union function. Probably what I really want to do is to concatenate all these small arrays in a huge < data-structure-to-be-defined > and then transform it in a set.

Does not a binary tree require the relevant data (the fields that need to be purged of duplicates) to be sortable? In my case they are in-plane [x,y] coordinates, so that’s not really the case. I have an integer key in the derived type too, but that’s not what I want to sort/search/merge here: that field is already sorted and unique, but different keys can correspond to same [x,y] coordinates (and that’s why I need to do the union and then re-evaluate the keys, so to end up again with them all sorted from 1 to size(union)).

Sorry, I’m evidently tired. The first time NA=NB=6, the second time NB=6 but NA∈[6:12], the third NB=6, NA∈[6:18]… but the actual average lower value would increase until NB=6, NA≈100. This is bad, you are right.

Isn’t the optimal solution here just a hash table?

1 Like

If you have data represented as bits, whether it is (x,y) data or anything else, then you can always generate a hash key for each item and use that key to sort/order the data. A binary search tree is one way to use that hash key to store compactly the data. I mentioned it previously because I have recently used that data structure in some of my work. There are other ways to store the data too, including hash tables.

1 Like

This is the strategy I had envisioned above with the “functional” version:

union = unique([a,b,c,d,...])

The concatenated list exists only for the duration of the call. However, I see your example is more complicated, (i.e. your set doesn’t contain just integers, but real coordinates).

I don’t have any intuition or knowledge of the size of your collections; my approach would be to try both and measure what works best.

1 Like

Probably yes, I also thought about that (and the fact that I store integer keys in the types is a related workaround), but I didn’t know it was possible to generate a hash for two real numbers.

So now I know I can look into that, thanks!

1 Like

No, well. I actually generate these real numbers starting with integer coordinates. I mean, each of these 6-long arrays is generates starting with a triplet of integers. Originally I wanted to hash these triplets, but then dropped the idea, because evidently I don’t want the same hash for all 6 points.

But then I wonder, can I just create the hash by combining the 3 “generators” integers and a [1:6] index for the points? I believe the answer is yes, but I don’t know if it would solve my problem:

  1. I can sort both A and B according to these hashes, but…
  2. An element of A and a element of B with same x,y coordinates would in general have a different hash, so can I really use the hashes to perform the union?

Maybe I need to stick to real-number-generated hashes, but my short sail across google made me suspect is quite a hassle to ensure I don’t have collisions (in many cases there is the danger of having hash(x,y)==hash(y,x) or similar pitfalls).

Either of these is trivially solved by using pretty much any other language which will provide built in ways of hashing tuples and or real numbers. If you want to write this yourself, so long as you use a good Integer hash function, hash(a,b,c) = xor(hash(a), 3*hash(b), 5*hash(c)) will work. Hashing floating point numbers is also pretty trivial. You can just hash the bits as if they were an integer, and everything works.

But what about when I compare those real numbers for the union?

Currently I just set A and B equal if(abs(A%x - B%x) < tol .and. abs(A%y - B%y)) < tol , of course, but then how this translates to bit-generated hashes? The bits of “equal” floats would not be the same in general…

Anyway yes, this is exactly what I normally do when dealing with this kind of stuff in matlab (I assume here a,b,c,d,… are arrays that are being concatenated). But it requires that the derived type contained in the array overloads both the equality (easy) and the < and > operators, since it internally calls sort.

Probably this

is almost true, in the sense that as far as I know neither matlab provides easy ways to generate hashes for a flexible set of types: the containers.Map type accepts only integers or strings as keys, which is more or less what also fhash does, though it provides facilities to define a custom hasher for your derived type (but still you have to implement it!).