Post by account_disabled on Feb 27, 2024 10:01:29 GMT
The centroids. At first we tested on tags and obtained reasonable results. Once kmeans had completed we pulled all of the centroids and obtained their nearest relative from the custom WordVec model then we assigned the tags to their centroid category in the main dataset. Tag Tokens Tag Pos Tag Lemm. Categorization beach photographs beach NN photographs NN beach photograph beach photo seaside photographs seaside NN photographs NN seaside photograph beach photo coastal photographs coastal JJ photographs NN coastal photograph beach photo seaside photographs seaside NN photographs NN seaside photograph beach photo seaside posters seaside NN posters NNS seaside poster beach photo coast photographs coast NN photographs NN coast photograph beach photo beach photos beach NN photos NNS beach photo beach photoThe Categorization column above was the centroid selected by Kmeans.
Notice how it handled the matching of seaside to beach Kazakhstan Phone Number and coastal to beach. Benefits This method seemed to do a good job of finding associations between the tags and their categories that were more semantic than characterdriven. Blue shirt might be matched to clothing. This was obviously not possible without the semantic relationships found within the vector space. Limitations Ultimately the chief limitation that we encountered was trying to run kmeans on the full two million tags while ending up with categories centroids.
Sklearn for Python allows for multiple concurrent jobs but only across the initialization of the centroids which in this case was meaning that even if you ran on a core processor the number of concurrent jobs was limited by the number of initialization which in this case was again . to reduce the vector sizes to but the results were overall poor. Finally because embeddings are generally built based on probabilistic closeness of terms in the corpus on which they were trained there were matches that you could understand why they matched but would obviously not have been the correct category eg th century art was picked as a category for th century art. Finally context matters and the word embeddings obviously suffer from understanding the difference between duck.
Notice how it handled the matching of seaside to beach Kazakhstan Phone Number and coastal to beach. Benefits This method seemed to do a good job of finding associations between the tags and their categories that were more semantic than characterdriven. Blue shirt might be matched to clothing. This was obviously not possible without the semantic relationships found within the vector space. Limitations Ultimately the chief limitation that we encountered was trying to run kmeans on the full two million tags while ending up with categories centroids.
Sklearn for Python allows for multiple concurrent jobs but only across the initialization of the centroids which in this case was meaning that even if you ran on a core processor the number of concurrent jobs was limited by the number of initialization which in this case was again . to reduce the vector sizes to but the results were overall poor. Finally because embeddings are generally built based on probabilistic closeness of terms in the corpus on which they were trained there were matches that you could understand why they matched but would obviously not have been the correct category eg th century art was picked as a category for th century art. Finally context matters and the word embeddings obviously suffer from understanding the difference between duck.