Commit Graph

  • 245264c2c9 fix respider frequency bug. Matt Wells 2013-10-21 15:06:23 -0700
  • 64a1c7c2f2 more bug fixes. if spiders disabled for row in url filters, don't spider the url. Matt Wells 2013-10-21 14:45:12 -0700
  • 978910ca7a fix more bugs. Matt Wells 2013-10-21 14:17:32 -0700
  • 1fb85db307 url filters fixes. Matt Wells 2013-10-21 13:44:30 -0700
  • dc4afad67e do not respider if collectiverespiderfreq is <= 0.0. added a url filter for that. added a couple url filters for retrying errors (tcp timed out, etc) Matt Wells 2013-10-21 12:04:08 -0700
  • 605289e130 fix a couple collection related bugs causing cores in crawlbot. Matt Wells 2013-10-21 11:38:33 -0700
  • d2d4379d5c remove debug point. Matt Wells 2013-10-20 10:25:26 -0700
  • 54915dc384 fix data corruption in RdbMem buffer when running with threads disabled. Matt Wells 2013-10-19 19:37:29 -0700
  • 85bca4f3d1 can now delete collection while spiders are out Matt Wells 2013-10-18 18:11:14 -0700
  • 889583ec4b now we can reset collection mid stream Matt Wells 2013-10-18 17:49:36 -0700
  • ecab57ff0f change collnum of reset collection so any adds in progress will fail. Matt Wells 2013-10-18 15:46:00 -0700
  • b589b17e63 fix collection resetting. Matt Wells 2013-10-18 15:21:00 -0700
  • 50313a815f use seeds and spots now Matt Wells 2013-10-18 11:53:14 -0700
  • a288217e9f a few bug fixes Matt Wells 2013-10-17 18:59:00 -0700
  • 84a3aded94 spider round updates correction Matt Wells 2013-10-17 17:18:05 -0700
  • df7fd21253 spider rounds update. Matt Wells 2013-10-17 17:17:19 -0700
  • fe8ebd23a3 added simplified redirect urls to spiderdb as a new spiderrequest. made XmlDoc::getLinks() call m_links.set(redirUrl.getUrl()) so that it is treated like an outlink on the page and gets added from addOutlinkSpiderRecsToMetaList(). Matt Wells 2013-10-17 12:06:12 -0700
  • 92413001fb dirty word detector revisions. we need a word-based function, isDirtyWord() which IDs single words, bigrams and trigrams. it'll be much faster than the current approach and won't slow down when the list of dirty words gets big. we then need isDirtyUrl() to use the logic in Speller.cpp to split a url up into composing words to run through isDirtyWord(). mwells 2013-10-16 20:19:49 -0700
  • b9f94d7d45 show cached json objects as application/json without term highlighting and the disclaimer Matt Wells 2013-10-16 17:54:17 -0700
  • d9b132fd5a make : into . for indexing json names. Matt Wells 2013-10-16 17:43:46 -0700
  • 74c2742ced fix mem leak of LinkInfo. fixed json output from injecting url. Matt Wells 2013-10-16 17:17:28 -0700
  • 70c4ef682d printing updates Matt Wells 2013-10-16 16:27:24 -0700
  • ee06428059 fix json indexing and searching Matt Wells 2013-10-16 16:15:28 -0700
  • 9d6c3626d8 json indexing/hashing updates. Matt Wells 2013-10-16 15:41:12 -0700
  • bb09b4f742 do not store diffbot api url in diffbot reply yet. later may want to store in each diffbot object doc maybe as part of the json content? Matt Wells 2013-10-16 15:24:22 -0700
  • 11897f09da turn off log debug msg. mwells 2013-10-16 16:24:08 -0600
  • f8256c3ef9 fix core from diffbot object doc not having valid dmoz info Matt Wells 2013-10-16 15:14:39 -0700
  • acad6f48d3 Merge branch 'master' of git@github.com:gigablast/open-source-search-engine Matt Wells 2013-10-16 14:54:02 -0700
  • dae005e4ae ensure dmoz info valid when making titlerec Matt Wells 2013-10-16 14:53:48 -0700
  • 57ee9739e5 fix addColl() logic for collectionless rdbs Matt Wells 2013-10-16 14:38:09 -0700
  • fc17521697 Merge branch 'master' into diffbot Matt Wells 2013-10-16 14:28:42 -0700
  • 22ef91a6f1 show all colls in json after deleteCrawl operation Matt Wells 2013-10-16 14:13:28 -0700
  • e565a861ae give nice reply from seed in json. show how many outlinks from same domain were found and how many outlinks were filtered. same for addurls (bulk add). Matt Wells 2013-10-16 14:03:14 -0700
  • 36e6e21ae3 url filter processing looking good. Matt Wells 2013-10-16 12:19:25 -0700
  • f5e5b0f5d3 fix crawlbot bugs Matt Wells 2013-10-16 12:12:22 -0700
  • d2f39cd1a0 comment updates mwells 2013-10-15 23:13:50 -0700
  • 6052f60c48 speed up dirty word detection since we added a bunch of new dirty words/phrases. mwells 2013-10-15 22:41:31 -0700
  • 70e4a57449 add respider freq parm to crawlbot page Matt Wells 2013-10-15 16:57:34 -0700
  • f345c35927 crawlbot fixes. Matt Wells 2013-10-15 16:31:59 -0700
  • f892b828d9 crawlbot api fixes Matt Wells 2013-10-15 14:08:55 -0700
  • 00029129fd crawlbot api fixes Matt Wells 2013-10-15 12:40:56 -0700
  • 65d6af7791 show expression/action pairs in crawlbot json output. Matt Wells 2013-10-15 11:50:57 -0700
  • ed4eff784b json output crawlbot fixes. mwells 2013-10-15 12:45:23 -0600
  • 70eaa542f3 minor change Matt Wells 2013-10-15 11:17:44 -0700
  • a3a9a43ded api nomenclature mwells 2013-10-15 12:31:02 -0600
  • 313eb1209e more crawlbot fixes mwells 2013-10-15 12:22:59 -0600
  • d8835acfef crawlbot api work. mwells 2013-10-15 11:54:54 -0600
  • 37a9e82060 update the dirty word list. but we still should remove tags, except maybe outlinks, and detect the dirty words on what remains. getting too many false positives in tags still. mwells 2013-10-15 01:01:19 -0700
  • 3db726c22e take out references to AdultBit.cpp, since it is no longer used. mwells 2013-10-14 23:21:58 -0700
  • 12bff1e9b0 fix potential problem of tons of points in our statsdb div graph. use hashtable to dedup points and save from printing out too many <div> tags. mwells 2013-10-14 22:52:29 -0700
  • 90fca8c171 fix "search in category" link. mwells 2013-10-14 22:39:42 -0700
  • 0096877127 fix "statsdb" graph so it seems to work now. mwells 2013-10-14 22:31:00 -0700
  • 9e9ef9c2cc still getting statsdb link to work. a little better now. mwells 2013-10-14 21:21:27 -0700
  • 8f93a72961 start using html div graph for PageStatsdb.cpp now too. mwells 2013-10-14 20:35:45 -0700
  • a0808df2ae got new diffbot api compiled mwells 2013-10-14 18:19:59 -0600
  • c19310cb7e code checkpoint mwells 2013-10-14 17:19:30 -0600
  • a562c65627 another code checkpoint. new json api for crawlbot. new url filters for crawlbot. mwells 2013-10-14 16:10:48 -0600
  • 5a7d70f7b2 code checkpoint mwells 2013-10-14 13:00:05 -0600
  • 80918ca6e3 remove old libplotter references and files. mwells 2013-10-13 23:48:07 -0700
  • 553c28fbe0 get performance graphing working again. use absolute divs to draw the graph instead of old gif plotter library. mwells 2013-10-13 23:39:31 -0700
  • 81a09f9835 half way done fixing performance graph. needs more work. mwells 2013-10-13 22:02:21 -0700
  • c7cf6a817a dmoz directory root page search box should just search all sites in dmoz. mwells 2013-10-13 20:13:15 -0700
  • 3ac5838b8f fix the search tabs for the dmoz directory search box. allow more error types when spidering dmoz docs. mwells 2013-10-13 18:43:45 -0700
  • 66364c581a minor fix when indexing dmoz urls. mwells 2013-10-13 17:12:20 -0700
  • 876af6d8c6 dmoz support is now updated and re-integrated. mwells 2013-10-13 16:53:28 -0700
  • 3bc85cf528 a few cleanups for the new dmoz code. mwells 2013-10-13 16:48:59 -0700
  • 0cc78dc2e0 fix dup bug. mwells 2013-10-13 16:06:38 -0700
  • d41d5554da fix dmoz search. mwells 2013-10-13 16:00:44 -0700
  • 4cbb31e180 added searchbox for dmoz pages/sites. mwells 2013-10-13 15:45:12 -0700
  • b60bdcc038 documentation updates. fixed sd=0. mwells 2013-10-13 14:24:41 -0700
  • 2c7bc9031f documentation updates. mwells 2013-10-13 13:15:31 -0700
  • 8547b8f802 print pretty dmoz pages. mwells 2013-10-13 00:39:05 -0700
  • d4b5c37f45 Merge branch 'master' into testing mwells 2013-10-13 00:20:37 -0700
  • fbcaefa6ff so we have spider https sites add the old gigablast private/public key file. mwells 2013-10-13 00:15:39 -0700
  • 65bad44450 try to fix EBADIP stopping a page from getting indexed into dmoz. mwells 2013-10-13 00:14:16 -0700
  • c949bfe315 ignore certain errors and index the doc anyway so we at least have it in our dmoz index with its designated title and summary from dmoz. mwells 2013-10-13 00:02:25 -0700
  • eeb10bb99a fix ip vector logic in xmldoc.cpp. mwells 2013-10-12 23:14:39 -0700
  • c283e85e40 add support for noindex meta tag. use it in the gbdmoz.urls.txt.* files that contain the dmoz urls we want to spider. mwells 2013-10-12 22:50:23 -0700
  • d300dc42f7 added XmlDoc::getDmozTitles() and related functions. mwells 2013-10-12 21:56:25 -0700
  • 3374ce450a fix a couple catdb generation bugs. MAX_CATIDS violation causing corruption. not saving catdb tree to catdb-saved.dat causing missing catdb recs. mwells 2013-10-12 20:33:04 -0700
  • 0de777d80d parser fixes mwells 2013-10-11 17:35:12 -0600
  • 6d5643e185 json parsing mwells 2013-10-11 16:14:26 -0600
  • 1d133e87c9 just print dmoz pages verbatim for now. later we can show the dmoz entries as search results. mwells 2013-10-10 23:18:57 -0700
  • 547420396f added Categories::printUrlsInTopic() to print the dmoz urls for a catid. can replace us doing a search for the time being. mwells 2013-10-10 23:03:07 -0700
  • 38bb82c902 make /Top the top dir page now. mwells 2013-10-10 22:46:31 -0700
  • 55c5ad2921 fix "Top/" issues in breadcrumb etc. mwells 2013-10-10 22:27:49 -0700
  • ca6af65217 git dmoz nagivation system working. now we just need to index the urls to populate dmoz. mwells 2013-10-10 22:08:21 -0700
  • be01041e36 added support for new url filter: "lastspidertime>={roundstart} --> IGNORE" so we can spider all urls before we advance to the next spider round and re-spider everything again. CollectionRec::m_spiderRoundStartTime and CollectionRec::m_spiderRoundNum are the new collection rec parms. show the round stuff on url filters page. mwells 2013-10-10 18:47:46 -0600
  • 08c153bdb0 add old gb.pem file, not used by gigablast any more, but allows https server to startup. mwells 2013-10-09 17:37:01 -0600
  • ea859ef685 added 'gb emailmandrill' for testing. got it working. it posts json, not url encoded. mwells 2013-10-09 17:35:51 -0600
  • 2bb8b818d6 more bug fixes with notification system. mwells 2013-10-09 16:28:15 -0600
  • c1c5c4e3d0 send notifications if no urls available for immediate spidering. mwells 2013-10-09 15:24:35 -0600
  • 24e3b8cf52 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot mwells 2013-10-09 13:07:22 -0600
  • 0b4bbf926e fix potential compiler error. Matt Wells 2013-10-09 11:52:58 -0700
  • 58e2be8b6f take out log msg Matt Wells 2013-10-09 11:51:39 -0700
  • 283ec2f6b4 email and webhook alerts when spider runs out of urls to spider. Matt Wells 2013-10-09 11:42:56 -0700
  • 7ba9994804 many dmoz fixes. but still more we need to do. isn't printing subcategories right now. mwells 2013-10-08 23:55:11 -0700
  • 3702a05d64 add sendEmailThroughMandrill() to send through mail chimp http api. Matt Wells 2013-10-08 18:01:38 -0700
  • 9eecfd378c added support for pageprocesspattern again. || separated strings to find in m_content before sending to diffbot. Matt Wells 2013-10-08 17:08:58 -0700
  • a78dc35169 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot mwells 2013-10-08 17:51:07 -0600