Easy consine similarity search, search similar features among vectors is a frequently encountered situation

class CosineSearch[source]

CosineSearch(base:ndarray)

Build a index search on cosine distance cos = CosineSearch(base_array) idx_order = cos(vec)

Search for similar vecs

Create random vector, assimuating 500 items x 128 embedding hidden size

base = np.random.rand(500,128)
cosine = CosineSearch(base)
cosine
[Consine Similarity Search] (500 items)

Rank the distance to 6th item

cosine(base[5])
array([  5, 268, 327, 309, 365,  34, 388, 173, 415, 135, 151, 461, 307,
       275, 469, 384, 416,  60, 293, 236, 153, 493, 464, 402,  74, 383,
        15, 294,  95, 485, 103, 488, 156, 122, 283, 379, 321, 477, 300,
       348, 100, 381, 317, 209, 231, 182, 174, 457, 332, 314, 256, 326,
       251, 313,   9, 183, 270, 133,  70, 424, 227, 399, 234, 205, 487,
        84,  31, 232, 322, 428, 311,  67, 380,  19, 471, 255, 419, 224,
       413, 247, 328, 436, 367,  64, 385, 279, 344, 406, 306, 238, 357,
       335, 248, 249, 312, 169, 221, 124, 297, 427,  52, 346, 136, 288,
       120,  93, 250, 495,  22, 143, 273, 206, 149, 305, 159, 438, 218,
       343, 195,  24, 142,  50, 150, 199, 434, 465, 123,  69, 223,  79,
       291, 154,  73,  10, 222,  18,  76,  68, 213, 139, 489, 323, 286,
       241, 158, 106,  92,  37, 301, 408, 141, 272,   2, 207, 179,  32,
       395, 390, 290, 366, 121, 138, 181, 292, 377,  20, 134, 282,   7,
       497, 108, 244, 210, 146, 360, 304, 404, 467, 296, 498, 472, 375,
        72, 338,  90,  43, 370,  16, 329, 362, 189, 180, 117, 168, 349,
        49, 483, 212, 391,  45, 444,  21,  17, 177, 118, 474, 201,  96,
       147, 431,  55, 478, 337, 192, 341, 363, 188,  71,  40, 127, 441,
       219, 421, 350, 400, 184, 240, 254, 451, 482,  44, 353,  33, 265,
       356, 437, 303, 442,  13, 228, 233,  81, 160, 462, 320, 420, 445,
        59, 155, 115, 264, 340, 128,  42, 387, 448,  89,  91, 246, 298,
       325, 429, 277, 476, 426, 475, 494, 425, 450, 358,  38, 262, 418,
       315,  46, 239,  87, 280, 230,  12, 152, 161,  28, 203, 319, 217,
        83, 310,  63, 318, 204, 432, 491, 113, 492,   3, 378, 440, 499,
       260, 334,  29, 237, 411, 331, 401, 208, 259, 473, 129,  85, 392,
       470, 110, 111, 243, 144,  97, 336, 253, 107, 109, 194, 145, 397,
       140, 114, 352,   6, 178, 284, 372, 463,  56, 148, 345, 137, 409,
       175,  61, 220, 299, 126, 164, 455,  36, 382, 459, 252, 215, 157,
       196, 333,   4, 449, 458,  58, 480, 263, 373,  25, 165, 452, 481,
       186, 446,  53,  98, 376, 171, 430, 368,  30, 324,  78, 351, 405,
       167, 214, 101,  94, 295, 130, 389, 460, 393, 308, 285, 229,  47,
       490, 271,  66, 191,  35, 197, 266, 102, 162, 386,   0, 407, 403,
       226, 435, 278, 447, 235, 347, 105,   8,  11,  41, 486, 245, 359,
       242, 225, 267, 374,  39, 371,  86, 433, 342, 456,  88, 276, 422,
       361, 354, 116, 369,  14,  75, 364, 484, 330, 257, 439, 316, 185,
       198, 200, 414, 163, 202, 261, 412, 187, 132, 281, 170, 172, 453,
        99, 496, 258, 302, 216, 468, 193, 410,  57, 274, 112,  77, 355,
       125,  23,   1, 339, 287,  54, 396, 190, 394, 454,  82, 289, 211,
        80, 417, 466,  26, 269, 443, 398, 479, 423,  48, 176,  51, 119,
        65, 131,  62, 166, 104,  27])

Rank the distance to 10th item

Returning the similarity value also

order, similarity = cosine.search(base[9], return_similarity=True)

pd.DataFrame({"order": order, "similarity": similarity}).head(10)
order similarity
0 9 0.146922
1 28 0.122393
2 436 0.122063
3 236 0.121338
4 135 0.120816
5 225 0.120507
6 113 0.120241
7 160 0.120201
8 446 0.119380
9 58 0.119282

Embedding interpretation

Interpreting pytorch embedding matrix by utilizing tensorboard in colab

class InterpEmbeddings[source]

InterpEmbeddings(embedding_matrix:ndarray, vocab:Dict[int, str])

interp = InterpEmbeddings(embedding_matrix, vocab_dict)

interp.search("computer")

visualize the embedding with tensorboard

interp.visualize_in_tb()

Usage for other task

eg. recommender sys

Suppose we have 500 movies, that we do have learnt latent vectors on these movies

Given 1 movie,can we find the most similar ones

As we can create feature eg. "You watched <...>, You may also like..."

NUM_ITEMS = 500

# an embedding maxtrix, in shape of
movie_embedding = np.random.rand(NUM_ITEMS,42)

# a dictionary mapping index to string
vocab = dict((i, f"movie #{i}") for i in range(NUM_ITEMS,))
interp = InterpEmbeddings(movie_embedding, vocab=vocab)
interp.search("movie #22")

Search with token id 22

tokens idx similarity
0 movie #22 22 0.262194
1 movie #295 295 0.227033
2 movie #108 108 0.225516
3 movie #0 0 0.223599
4 movie #144 144 0.222527
5 movie #232 232 0.222521
6 movie #498 498 0.222113
7 movie #381 381 0.220832
8 movie #220 220 0.219798
9 movie #305 305 0.218822
10 movie #236 236 0.218546
11 movie #14 14 0.218388
12 movie #274 274 0.217985
13 movie #127 127 0.217889
14 movie #326 326 0.217152
15 movie #363 363 0.217147
16 movie #61 61 0.216979
17 movie #352 352 0.216956
18 movie #345 345 0.216823
19 movie #147 147 0.216784