Cosine similarity search¶

Search for similar vecs¶

Create random vector, assimuating 500 items x 128 embedding hidden size

base = np.random.rand(500,128)

cosine = CosineSearch(base)
cosine

[Consine Similarity Search] (500 items)

Rank the distance to 6th item¶

cosine(base[5])

array([  5, 268, 327, 309, 365,  34, 388, 173, 415, 135, 151, 461, 307,
       275, 469, 384, 416,  60, 293, 236, 153, 493, 464, 402,  74, 383,
        15, 294,  95, 485, 103, 488, 156, 122, 283, 379, 321, 477, 300,
       348, 100, 381, 317, 209, 231, 182, 174, 457, 332, 314, 256, 326,
       251, 313,   9, 183, 270, 133,  70, 424, 227, 399, 234, 205, 487,
        84,  31, 232, 322, 428, 311,  67, 380,  19, 471, 255, 419, 224,
       413, 247, 328, 436, 367,  64, 385, 279, 344, 406, 306, 238, 357,
       335, 248, 249, 312, 169, 221, 124, 297, 427,  52, 346, 136, 288,
       120,  93, 250, 495,  22, 143, 273, 206, 149, 305, 159, 438, 218,
       343, 195,  24, 142,  50, 150, 199, 434, 465, 123,  69, 223,  79,
       291, 154,  73,  10, 222,  18,  76,  68, 213, 139, 489, 323, 286,
       241, 158, 106,  92,  37, 301, 408, 141, 272,   2, 207, 179,  32,
       395, 390, 290, 366, 121, 138, 181, 292, 377,  20, 134, 282,   7,
       497, 108, 244, 210, 146, 360, 304, 404, 467, 296, 498, 472, 375,
        72, 338,  90,  43, 370,  16, 329, 362, 189, 180, 117, 168, 349,
        49, 483, 212, 391,  45, 444,  21,  17, 177, 118, 474, 201,  96,
       147, 431,  55, 478, 337, 192, 341, 363, 188,  71,  40, 127, 441,
       219, 421, 350, 400, 184, 240, 254, 451, 482,  44, 353,  33, 265,
       356, 437, 303, 442,  13, 228, 233,  81, 160, 462, 320, 420, 445,
        59, 155, 115, 264, 340, 128,  42, 387, 448,  89,  91, 246, 298,
       325, 429, 277, 476, 426, 475, 494, 425, 450, 358,  38, 262, 418,
       315,  46, 239,  87, 280, 230,  12, 152, 161,  28, 203, 319, 217,
        83, 310,  63, 318, 204, 432, 491, 113, 492,   3, 378, 440, 499,
       260, 334,  29, 237, 411, 331, 401, 208, 259, 473, 129,  85, 392,
       470, 110, 111, 243, 144,  97, 336, 253, 107, 109, 194, 145, 397,
       140, 114, 352,   6, 178, 284, 372, 463,  56, 148, 345, 137, 409,
       175,  61, 220, 299, 126, 164, 455,  36, 382, 459, 252, 215, 157,
       196, 333,   4, 449, 458,  58, 480, 263, 373,  25, 165, 452, 481,
       186, 446,  53,  98, 376, 171, 430, 368,  30, 324,  78, 351, 405,
       167, 214, 101,  94, 295, 130, 389, 460, 393, 308, 285, 229,  47,
       490, 271,  66, 191,  35, 197, 266, 102, 162, 386,   0, 407, 403,
       226, 435, 278, 447, 235, 347, 105,   8,  11,  41, 486, 245, 359,
       242, 225, 267, 374,  39, 371,  86, 433, 342, 456,  88, 276, 422,
       361, 354, 116, 369,  14,  75, 364, 484, 330, 257, 439, 316, 185,
       198, 200, 414, 163, 202, 261, 412, 187, 132, 281, 170, 172, 453,
        99, 496, 258, 302, 216, 468, 193, 410,  57, 274, 112,  77, 355,
       125,  23,   1, 339, 287,  54, 396, 190, 394, 454,  82, 289, 211,
        80, 417, 466,  26, 269, 443, 398, 479, 423,  48, 176,  51, 119,
        65, 131,  62, 166, 104,  27])

Rank the distance to 10th item¶

Returning the similarity value also

order, similarity = cosine.search(base[9], return_similarity=True)

pd.DataFrame({"order": order, "similarity": similarity}).head(10)

Embedding interpretation¶

Interpreting pytorch embedding matrix by utilizing tensorboard in colab

Usage for other task¶

eg. recommender sys¶

Suppose we have 500 movies, that we do have learnt latent vectors on these movies

Given 1 movie,can we find the most similar ones

As we can create feature eg. "You watched <...>, You may also like..."

NUM_ITEMS = 500

# an embedding maxtrix, in shape of
movie_embedding = np.random.rand(NUM_ITEMS,42)

# a dictionary mapping index to string
vocab = dict((i, f"movie #{i}") for i in range(NUM_ITEMS,))

interp = InterpEmbeddings(movie_embedding, vocab=vocab)

interp.search("movie #22")

	tokens	idx	similarity
0	movie #22	22	0.262194
1	movie #295	295	0.227033
2	movie #108	108	0.225516
3	movie #0	0	0.223599
4	movie #144	144	0.222527
5	movie #232	232	0.222521
6	movie #498	498	0.222113
7	movie #381	381	0.220832
8	movie #220	220	0.219798
9	movie #305	305	0.218822
10	movie #236	236	0.218546
11	movie #14	14	0.218388
12	movie #274	274	0.217985
13	movie #127	127	0.217889
14	movie #326	326	0.217152
15	movie #363	363	0.217147
16	movie #61	61	0.216979
17	movie #352	352	0.216956
18	movie #345	345	0.216823
19	movie #147	147	0.216784

Interpretation

Cosine similarity search¶

`class` `CosineSearch`[source]

Search for similar vecs¶

Rank the distance to 6th item¶

Rank the distance to 10th item¶

Embedding interpretation¶

`class` `InterpEmbeddings`[source]

visualize the embedding with tensorboard¶

Usage for other task¶

eg. recommender sys¶

Search with token id 22

	order	similarity
0	9	0.146922
1	28	0.122393
2	436	0.122063
3	236	0.121338
4	135	0.120816
5	225	0.120507
6	113	0.120241
7	160	0.120201
8	446	0.119380
9	58	0.119282

Interpretation

Cosine similarity search¶

class CosineSearch[source]

Search for similar vecs¶

Rank the distance to 6th item¶

Rank the distance to 10th item¶

Embedding interpretation¶

class InterpEmbeddings[source]

visualize the embedding with tensorboard¶

Usage for other task¶

eg. recommender sys¶

Search with token id 22

`class` `CosineSearch`[source]

`class` `InterpEmbeddings`[source]