Commit Graph

  • 5022ea4d6e try ditching pthreads and using straight-up errno. it seems perhaps each clone() gets its own copy of errno now? Matt Wells 2013-11-17 19:43:20 -07:00
  • 91279ff475 committing an abandoned asyncio project. mwells 2013-11-17 19:15:38 -07:00
  • dfab4ee13d fixed bugs with advanced.html advanced search page. made stats graph only show last 5 minutes of stats. tends to make the graph look more continuous. do not use ajax to fetch the search results unless this is running in matt wells' datacenter. it is only an anti bot scraping measure and unnecessarily complicates things for others. Matt Wells 2013-11-17 14:58:47 -07:00
  • 64ef37db2f Merge branch 'master' of git@github.com:gigablast/open-source-search-engine Matt Wells 2013-11-17 11:48:29 -07:00
  • 5cf25cfa3f fix graphing bug when graphing performance graph. #background-color:xxxxxx was not always 6 hex digits and would not render right because of that. Matt Wells 2013-11-17 11:48:17 -07:00
  • 12a6783a24 Update README.md Gigablast 2013-11-16 20:14:06 -08:00
  • dc226dde0e fix LinkInfo mem leaks mwells 2013-11-16 17:50:32 -08:00
  • 75e35b0c8d fix pthread_join bug some more. Matt Wells 2013-11-16 18:34:06 -07:00
  • e5da0ac967 do not try to join on thread when pthread_create() fails to create thread. was causing core. Matt Wells 2013-11-16 18:28:49 -07:00
  • e756aabf4b allow redirects to goog and bing again Matt Wells 2013-11-16 12:11:25 -07:00
  • e27646c088 cleanup fixes. Matt Wells 2013-11-15 15:01:56 -07:00
  • f43b7dca13 Merge branch 'master' into testing Matt Wells 2013-11-15 14:47:35 -07:00
  • 5e30728a3a new graphic icons. minor clean ups. Matt Wells 2013-11-15 14:47:05 -07:00
  • e04deb82d1 log when url matches page process pattern and which pattern it matches. Matt Wells 2013-11-15 13:11:05 -08:00
  • c9af8adf6e when someone deletes/resets a coll we clear the lock table and the msg12 handler can not confirm a lock request since the table was cleared out! so do not core!! Matt Wells 2013-11-15 12:37:27 -08:00
  • 6495dfd86e try to fix json parser overflow error. needs testing. tried to fix round num from incrementing for little job because i think server overload. should be fixed right some time. just made wait time 30 secs instead of 10 in Spider.cpp. Matt Wells 2013-11-15 11:30:16 -08:00
  • 6b9d0656ff Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-15 09:36:09 -08:00
  • 3258360679 do not do array break up substition logic on diffbot replies if not of type product or image. it was breaking up the images array WITHIN an article type. Matt Wells 2013-11-15 09:35:32 -08:00
  • bf75ac6a0d fix page process pattern parsing Matt Wells 2013-11-15 09:34:47 -08:00
  • 3563f0643f fix little but of using product not image Matt Wells 2013-11-15 09:13:27 -08:00
  • fe1a7d1a75 rdbbase not fully resetting? it was trying to dump to coll directories that had been moved to trash folder. and printing out "deleted from under us". at least it was corrupting data in RdbMem this time because i added m_dumpErrno logic. Matt Wells 2013-11-15 09:01:58 -08:00
  • 9ed40a1112 hacky hacks Matt Wells 2013-11-14 16:59:50 -08:00
  • bb964ac214 fix core Matt Wells 2013-11-14 16:28:23 -08:00
  • b0e40ae68b fix bad json bug Matt Wells 2013-11-14 15:05:15 -08:00
  • 1518778405 fix for bad json splicing Matt Wells 2013-11-14 14:42:31 -08:00
  • 7fc8b6a005 fix oopsy Matt Wells 2013-11-14 14:09:05 -08:00
  • 7c84b6ee0b show restart crawl button Matt Wells 2013-11-14 14:07:45 -08:00
  • 62432b3530 support for &restart=1 Matt Wells 2013-11-14 14:02:56 -08:00
  • 3033684a8d fix for json parsing. added restart=1 support Matt Wells 2013-11-14 13:16:08 -08:00
  • 9059aa8a01 fix link Matt Wells 2013-11-14 12:53:49 -08:00
  • be213ca28f now fix embedded products and images in the diffbot json reply properly! Matt Wells 2013-11-14 12:51:34 -08:00
  • 28cd1e6490 you can submit action then expression now. Matt Wells 2013-11-14 09:54:36 -08:00
  • 8534914902 fix core when xmldoc::getmsg20reply is called Matt Wells 2013-11-14 09:32:18 -08:00
  • 5c0194c439 fix json validation bug Matt Wells 2013-11-13 19:29:33 -08:00
  • eb719849a6 do not core on this dump error Matt Wells 2013-11-13 19:04:22 -08:00
  • da013d1b18 fix invalid json bug of not ending json items in images/products array Matt Wells 2013-11-13 18:44:15 -08:00
  • 45cc9bb112 fix a few nasty bugs Matt Wells 2013-11-13 18:31:26 -08:00
  • a5c3b3b8f8 fix so spider does not say it is done crawling right after you seed it! Matt Wells 2013-11-13 16:03:15 -08:00
  • 7020f66daa bulk api nominal updates Matt Wells 2013-11-13 14:30:51 -08:00
  • 9e77f1b2f6 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-13 13:27:45 -08:00
  • a31b13ad61 fix a few bugs. Matt Wells 2013-11-13 13:27:22 -08:00
  • 42b5638680 added log msg mwells 2013-11-13 00:57:59 -07:00
  • 6cc4e6d980 added some more links to my gui Matt Wells 2013-11-12 17:05:13 -08:00
  • 7f038235e1 hack in a type:product or type:image since product and image json elements are taken from an array and lack those. Matt Wells 2013-11-12 16:57:14 -08:00
  • df28c4e0c2 search results in csv format. remove serps per page limit if custom crawl. Matt Wells 2013-11-12 16:33:45 -08:00
  • 38c8bec024 use gbspiderdate not spiderdate. so gotta use gbsortby:gbspiderdate etc. Matt Wells 2013-11-12 13:55:47 -08:00
  • fbcd6b8afd display json objects that are not in arrays in csv. show csv header. how to deal with heterogenous object lists? index spiderdate: for gbsortby:spiderdate. added gbrevsortby: support. Matt Wells 2013-11-12 13:51:52 -08:00
  • 364216ff16 fixed bugs in sort by prices, etc. Matt Wells 2013-11-11 18:58:45 -08:00
  • 4548098809 a couple more nominal updates Matt Wells 2013-11-11 16:10:47 -08:00
  • ad61e9ea5a /v2/bulk api updates. Matt Wells 2013-11-11 15:52:04 -08:00
  • 7248641bc4 fix mem leaks. turn off electric fence. Matt Wells 2013-11-11 09:58:14 -08:00
  • 7efb743e65 nothing Matt Wells 2013-11-10 22:25:19 -08:00
  • 5aa1609350 Merge branch 'master' into diffbot Matt Wells 2013-11-10 22:11:39 -08:00
  • af678b7c1b fix a few bugs. Matt Wells 2013-11-10 22:11:13 -08:00
  • 105a201cde fix mem leak. check if tree writes are disabled and block until not when deleting/resetting a collection. just like we do it tree is being saved. Matt Wells 2013-11-10 16:28:00 -08:00
  • 810a6918fd Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-10 09:41:19 -08:00
  • 3afac4812d fix bug of trying to del/reset coll while disable writing was engaged. we already had it check to see if tree was saving, but not if writes were disabled. so it gets ETRYAGAIN and retries later. Matt Wells 2013-11-10 09:40:32 -08:00
  • b1e98aa4b8 fix core. Matt Wells 2013-11-08 21:33:37 -07:00
  • e395628d5a use &format=0 1 or 2 for html/xml/json now. use &icc=1 to get dump of json objects in serps. Matt Wells 2013-11-08 18:00:30 -08:00
  • aa9a77674f fixed oopsy when parsing float words Matt Wells 2013-11-08 16:25:23 -08:00
  • 09f28b2f26 now we index all numbers that have field names (so can't just be a number in the body) but it can be in a meta tag or json item. then use like gbsortby:products.offerPrice to sort the search results (json objects) by that. Matt Wells 2013-11-08 16:16:13 -08:00
  • 9895ad093f fix that pesky spider start time bug. Matt Wells 2013-11-07 16:43:02 -08:00
  • a76f4c6974 just POST a full request for webhook now so we can do application/json content type Matt Wells 2013-11-07 14:20:15 -08:00
  • ab9a3b1798 out download links to api output for a crawl Matt Wells 2013-11-07 14:07:38 -08:00
  • 3e4db4f1bc show all crawl details in url webhook notification in the post body. Matt Wells 2013-11-07 13:59:43 -08:00
  • 2ae04cff71 return crawl delete reply in json. take out EDOCEVILREDIRECT errors. Matt Wells 2013-11-07 09:55:47 -08:00
  • 3b929917d1 do not site cluster or do dup removal in crawlbot search results Matt Wells 2013-11-07 09:40:31 -08:00
  • 396a88799a fix bad bug of basically emptying out all our data on auto-save! Matt Wells 2013-11-06 19:49:20 -08:00
  • 73de13b2da fix core. Matt Wells 2013-11-06 17:15:29 -08:00
  • 0655160c26 fixed quite a few nasty bugs. collectionrec neg/pos key counting overruns. Matt Wells 2013-11-06 15:44:50 -08:00
  • afb5a2be64 Merge branch 'master' into diffbot Matt Wells 2013-11-06 10:18:04 -08:00
  • 6c2604b0df re-do spider fix Matt Wells 2013-11-06 10:17:13 -08:00
  • bf74eba667 Revert "fix spider "could launch" setting because" Matt Wells 2013-11-06 10:16:46 -08:00
  • 22a153cbe3 fix spider "could launch" setting because if we have all spdiers out and doing a tcp timed out then the round would end! Matt Wells 2013-11-06 10:15:35 -08:00
  • 34ce22fe19 use timeout bash cmd to prevent ppthtml, etc. hangs Matt Wells 2013-11-06 11:12:00 -07:00
  • 5cbcba55fe take out ppthtml call because it is too buggy until we get a ulimit replacement to work. Matt Wells 2013-11-06 10:51:47 -07:00
  • 1d03cd740d debug comment Matt Wells 2013-11-05 14:36:23 -08:00
  • 263bb8dfbc fix oops Matt Wells 2013-11-05 14:32:56 -08:00
  • 2b904e9563 include firstip in the spider url lock, not just uh48, because using fake ips results in having the same url crawled twice since it is from a different "firstip" so we should include "firstip" in the lock as well to prevent a double round increment. see comment in Spider.cpp to this effect. Matt Wells 2013-11-05 14:31:05 -08:00
  • f0adb26fdc remove expired locks more often. was causing stuff not to get spidered. Matt Wells 2013-11-05 13:09:56 -08:00
  • 2c7035ac2b do not truncate diffbot reply Matt Wells 2013-11-05 11:17:54 -08:00
  • fbc743ad5f fixed core dump when host does not have /etc/hostname file present. Matt Wells 2013-11-05 10:13:25 -07:00
  • d5c86f720d Merge branch 'master' of git@github.com:gigablast/open-source-search-engine Matt Wells 2013-11-05 09:33:47 -07:00
  • 5a5973a47f privacy.html update Matt Wells 2013-11-05 09:33:42 -07:00
  • 0baf1e68c1 Merge branch 'master' of git@github.com:gigablast/open-source-search-engine Matt Wells 2013-11-04 22:03:15 -07:00
  • 8fe64c165c fix potential core dump Matt Wells 2013-11-04 22:03:03 -07:00
  • 9335efbf00 fix bug of jenkins spidering same url at same time in different colls Matt Wells 2013-11-04 17:08:11 -08:00
  • 74cd3fe0a1 fix spider status stuff. Matt Wells 2013-11-04 16:35:58 -08:00
  • 34bffc2cc6 1-second crawl info sleep wrapper update Matt Wells 2013-11-04 16:02:03 -08:00
  • ca94750d72 global crawl info realtime updates on local host Matt Wells 2013-11-04 15:07:53 -08:00
  • 6f4ce06001 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-04 14:41:52 -08:00
  • 2a503095d4 nothing Matt Wells 2013-11-04 14:41:36 -08:00
  • 2b9308daef fix deduping. Matt Wells 2013-11-04 14:41:09 -08:00
  • 9e955f73a3 fix onlyProcessIfNew parm. fixed gbcontenthash: stuff Matt Wells 2013-11-04 13:57:44 -08:00
  • 8c9d5d824b support for gbcontenthash:xxxxx for doing exact match deduping. highest site rank page wins, on ties, lowest docid wins. Matt Wells 2013-11-04 13:47:13 -08:00
  • d78413a6c0 quick json validation fix Matt Wells 2013-11-04 11:34:22 -08:00
  • d22d2f560e fix json object dump. valid json. Matt Wells 2013-11-04 11:29:22 -08:00
  • 9150e8ed50 just show json for specified "name" Matt Wells 2013-11-04 11:05:10 -08:00
  • 7b319e5948 show more info in the urls csv file. record whether we processed the url or not in the SpiderReply. normalize /index.html etc. to / for the outlinks. in Links.cpp class. Matt Wells 2013-11-04 10:49:31 -08:00
  • c13cce9d72 fix for proxy core mwells 2013-11-03 22:43:44 -07:00